Real Time Fuzzy Matching with Spark and Elastic Search

Slides PDF Video

Near duplicates are a big cause of concern in data analysis. Near similar records affect our ability to integrate data from multiple sources, perform credit checks, assess best leads by matching internal data with external data, adhere to compliance rules and create a holistic view of systems. This talk will discuss the way we leverage Apache Spark, machine learning and Elastic Search to provide real time fuzzy matching. Our application integrates Spark with Elastic Search to provide the user the ability to query a record to find other records in the system which are same or nearly similar to it. In this talk, I will discuss our creation and use of labeled data to learn similarity models. I will also discuss our integration of Spark and Elastic Search to create indices which are queried at real time to find the best matching records.

Photo of Sonal Goyal

About Sonal

Sonal is the founder, CEO at Nube Technologies (, a startup which makes tools for big data wrangling. Nube’s product Reifier is built on Apache Spark and leverages machine learning to fuzzily match near similar entities and records. Sonal is the lead architect and developer for Reifier which is used by enterprises to find duplicates, cleanse records and achieve a 360 view of data. Previously, she also open sourced HIHO for Hadoop ETL and Crux reporting for Hbase at Github. Besides hi tech, she enjoys reading and crafting. Sonal holds a BTech from IIT Delhi.