Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Random Forest is a popular workhorse of machine learning. However, individual decision trees that compose Random Forests are difficult to train on “big” data because 1. decision trees are non-parametric models and thus their complexities and sizes tend to get larger with data sizes 2. in Random Forests, in order to minimize biases of individual trees, trees are usually not pruned or regularized. Here, we present Sequoia Forest, a Spark-based distributed implementation of Random Forest. We show that Sequoia Forest consisting of hundreds of trees with millions of nodes can be trained on billions of rows with thousands of features within a reasonable amount of time. Additionally, to justify building fully-grown trees on “big” data, we show that Sequoia Forest usually outperforms Random Forest trained on smaller pruned trees or trees trained on smaller random sub-samples.
Sung Hwan is a software engineer/data scientist who has worked on various machine learning problems in ecommerce, search-engine, computer vision, cloud systems, etc. Currently working for Alpine Data Labs, he has also worked for eBay and Microsoft in the past. He holds patents in computer vision and ecommerce systems and has a master’s degree from Stanford University.