Real world machine learning applications typically consist of many components in a data processing pipeline. For example, in text classification, preprocessing steps like n-gram extraction, and TF-IDF feature weighting are often necessary before training of a classification model like an SVM. We describe a framework for constructing these ML Pipelines and show how it can help us construct end-to-end workflows with a toolbox of off-the-shelf components which we have developed for text, image classification and a high-performance linear algebra library that we use for training models. We show that with this framework we can get state-of-the-art results in many machine learning tasks. Our scalable implementation on Spark outperforms supercomputing installations and can match deep learning error rates on speech recognition in less than 1 hour on EC2 for $20. Finally, we discuss research in the AMPLab to support common iterative machine learning workflows by careful resource estimation and checkpoint planning.
Evan Sparks is a PhD Student in the Computer Science Division at UC Berkeley. His research focuses on the design and implementation of distributed systems for large scale data analysis and machine learning. Prior to Berkeley he spent several years in industry tackling large scale data problems as a Quantitative Financial Analyst at MDT Advisers and as a Product Engineer at Recorded Future. He holds a bachelor’s degree from Dartmouth College.
Shivaram Venkataraman is a fourth year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing systems for large scale machine-learning. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google.