Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Spark and MLlib are a terrific toolkit for fitting very large scale classification, regression, collaborative filtering, and clustering models. However, taking a vague problem statement like “”learn a classifier”” and translating that into a working model presently requires a high degree of hand tuning and trial and error. Additionally, when models are expensive to train, shortening this process can significantly reduce total investment in developing a model.
We present our work on the MLbase optimizer – a system designed to quickly search hyperparemeter space to find a good model without manual effort from the user. Our system, built on Spark, offers a 10x speedup over naive methods for model search by leveraging performance enhancements, better search algorithms, and statistical heuristics.
Evan Sparks is a PhD Student in the Computer Science Division in the UC Berkeley AMPlab. His research focuses on the design and implementation of distributed systems for large scale data analysis and machine learning. Prior to Berkeley he spent several years in industry tackling large scale data problems as a Quantitative Financial Analyst at MDT Advisers and as a Product Engineer at Recorded Future. He holds a bachelor’s degree from Dartmouth College.