Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical variables, extend to the multi-class classification setting, do not require feature scaling and are able to capture non-linearities and feature interactions. Spark RDDs are naturally suited for efficient computations on narrow dependencies and multiple passes over the data especially with in-memory caching, thus making Spark an ideal choice for decision tree implementation.
In this talk, we will describe the implementation of tree algorithms in MLlib that is able to handle massive datasets. We will describe the challenges for implementing a distributed decision tree, including repeated histogram computations on various subsets of the data defined by the branching filters. We will also demonstrate the performance of our implementation via weak and strong scaling results for multiple datasets and cluster sizes. Finally, we will demonstrate how the decision tree implementation can be used as a building block for ensemble methods like boosting and random forests, which are considered top performers for both classification and regression tasks.
Manish Amde is a software engineer at Origami Logic developing machine learning and information retrieval algorithms for their marketing intelligence platform. Prior to Origami Logic, he worked at two other startups focusing on large-scale signal processing problems. His past research spans multiple fields in electrical and computer engineering and has led to several papers, patents and a book chapter. He holds a bachelor’s degree from IIT Bombay and received his doctorate degree from UC San Diego.
Hirakendu Das is currently a research scientist at Yahoo Labs. His research work is centered around applications and development of scalable machine learning algorithms for advertising systems and web data analytics at large. Prior to joining Yahoo, he recieved a Ph.D. degree from University of California, San Diego and his B.Tech. degree from Indian Institute of Technology Madras.
Evan Sparks is a PhD Student in the Computer Science Division in the UC Berkeley AMPlab. His research focuses on the design and implementation of distributed systems for large scale data analysis and machine learning. Prior to Berkeley he spent several years in industry tackling large scale data problems as a Quantitative Financial Analyst at MDT Advisers and as a Product Engineer at Recorded Future. He holds a bachelor’s degree from Dartmouth College.
Ameet Talwalkar is an NSF post-doctoral fellow at UC Berkeley and a consultant at Databricks. His research addresses scalability and ease-of-use issues in the field of statistical machine learning, with applications related to large-scale genomic sequencing. He started the MLlib project in Apache Spark and is also a co-author of the graduate-level textbook entitled “Foundations of Machine Learning” (2012, MIT Press). Next year he will join UCLA’s Computer Science Department as an Assistant Professor.