Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
At Euclid, we are making the physical world just as machine readable, trackable, and actionable as cookies and click-throughs have made the online retail world. To do this, we process logs from sensors around the globe to understand the behaviors of people and their interactions with physical retail locations. This challenging task requires us to model each user’s behavior at a device level, meaning that we design, train, and deploy thousands of machine learning models daily. We have recently introduced Spark into the core of our analytics stack. Doing so has enabled greater flexibility in our analysis, improved accuracy, reporting, and testing. There are two parts that we intend to discuss: the technological and programmatic aspects of switching to a strongly typed system for our small engineering team, and the continued challenges we face in deploying daily-tuned random forest and naive Bayes models at scale.
David Strauss received a BA (2008) in Physics from Dartmouth College and MA (2010) and PhD (2013) from Stanford in Electrical Engineering. His research has focused on large scale modeling, optimization, and machine learning in radio environments. Since 2013, he has been at Euclid Analytics working as a data scientist and engineer. He has been an advocate and user of the Spark project for several years for data processing and prediction.