Decision Trees, Gradient-Boosted Trees, and Random Forests are among the most commonly used learning methods in Spark MLlib. As datasets grow, there is a pressing need to model high-dimensional data and use highly expressive (i.e., deep) trees. However, most of the Decision Tree code in MLlib uses optimizations borrowed from Google’s PLANET framework, which scales poorly as data dimensionality and tree depths grow. Yggdrasil is a new distributed tree learning algorithm implemented in Spark that scales well to high-dimensional data and deep trees. Unlike PLANET, Yggdrasil partitions the training data vertically (by column) rather than horizontally (by row), leading to substantially lower communication costs. In our evaluation, we found that, for a single tree, Yggdrasil outperforms Spark MLlib’s standard Decision Tree algorithm by 13x on a large dataset (2 million rows, 3500 features) from a leading Web company. Yggdrasil is open-source, and we plan to publish it as a Spark package to let users take advantage of this improved performance.
Firas is a first-year PhD student at MIT, working with Matei Zaharia and Samuel Madden on problems related to large-scale, distributed machine learning. Previously, as a Master’s student at Stanford, he worked with Christopher Ré on the Caffe Con Troll project, an optimizer for neural networks that was featured at the DanaC Workshop at SIGMOD/PODS 2015.