Entity Resolution techniques try to automate the process of resolving different representations of the same entity to the entity that they represent. In this work we build a distributed framework for entity resolution which is based on distributed graph analytics. In this framework different representations of entities are modeled as the nodes of a graph, and on the other hand, different relations between these representations (such as string similarities, contextual associations, etc.) are modeled as edges of this graph. The framework uses message passing to find paths between different nodes through which it establishes likelihood of two nodes representing the same entity. As a use-case we apply this framework to perform clean-up on our Whois data set. In order to put this in perspective, values like “United States of America”, “United States”, “USA”, “US”, “US of A”, etc., which clearly represent the same country, can be seen in the “Registrant Country” field in the Whois database. The goal is to resolve all such values to the right country that they represent using entity resolution techniques. We also need to consider the fact that the Whois data set is a few terabytes in size and has a few hundreds of millions of records, and as a result we need to use distributed parallel processing systems for implementing this framework. Using entity resolution techniques we build a graph with hundreds of millions of nodes and over a billion edges. The framework uses Spark Map-Reduce for processing the data and building the nodes and edges, and it uses GraphX to build the distributed graph and run our graph traversal algorithms on the distributed graph. The result of applying the framework on this use-case is a significantly cleaner set of values for different fields of our Whois data set and completely normalized values for fields such as “Registrant Country”. Moreover due to the capabilities that the Spark framework provides we can conveniently implement the entire framework in one environment and run the code on the data that is cached into the memory and get results in a few minutes.
Mahdi Namazifar is currently a Data Scientist at Cisco Systems’ San Francisco Innovation Center (SFIC). I received my PhD in Operations Research from University of Wisconsin-Madison in 2011. During my PhD studies I was also affiliated with Wisconsin Institute for Discovery (WID) and the French Institute for Research in Computer Science and Automation (INRIA). Also I was a National Science Foundation (NFS) Grantee at the San Diego Supercomputer Center in 2007 and a Research Intern at IBM T.J. Watson Research Lab in 2008. After graduate school and before my current position at Cisco I was a Scientist at Opera Solutions.