San Francisco
June 30 - July 2, 2014

Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.


Spark Summit 2014
Fuzzy matching with Spark
Sonal Goyal (Nube Technologies)

Business data comes with a lot of noise. To effectively model and analyze the vast amounts of ever growing data, we need effective tools to link and group similar entities together. In this talk, we will discuss how we have used Spark’s machine learning, distributed and in memory capabilities to create a fuzzy matching engine which can learn from given samples of similar records and apply that knowledge to cleanse, deduplicate and link records.

Sonal is the founder of Nube Technologies and Lead Architect of Reifier, a fuzzy matching and entity resolution engine for big data. She has been working on big data and distributed technologies and previously open sourced Crux Reporting for HBase and HIHO Hadoop connector at Github. Besides technology, she likes to read and spend time with her family.

Slides PDF |Video