San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
SILK: A Spark Based Data Pipeline to Construct a Reliable and Accurate Food Dataset
Hesamoddin Salehian (Myfitnesspal)

The quality of food data is highly critical to support the successful health tracking of MyFitnessPal’s 75M+ users. Since MFP’s food data is crowd-sourced, it leads to inconsistencies. In this talk, I will describe a scalable and high-throughput data cleaning pipeline, that is based on Spark. Its implementation combines massive food and user log transaction data to construct a highly accurate and pure food dataset. Our Spark based implementation allows quick iteration and high elasticity by leveraging Spark’s extensibility and flexibility. In addition, it facilitates the extension to more complex data fusion techniques like Link Analysis ones using the built-in link analysis functionality of GraphX.

Hesamoddin Salehian was born in 1987 in Tehran, Iran. He received his Bachelor of Science in Computer Engineering from Sharif University of Technology, Iran, in 2010. He earned his Master of Science from University of Florida, Gainesville, in Computer Engineering in September 2014. He received his Doctor of Philosophy in Computer Engineering from University of Florida, in December 2014. He has been working in MyFitnessPal Inc. as a Data Scientist since January 2015. His research interests revolve around Data Science, Machine Learning and Computer Vision.

Slides PDF |Video