Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Prior to v1.0, MLlib only supports dense data in regression, classification, and clustering, while sparse data dominates in practice. In this talk, we will show the design choices we’ve made to support sparse data in MLlib and the optimizations we used to take advantage of sparsity in k-means, gradient descent, column summary statistics, tall-and-skinny SVD and PCA, etc.
Xiangrui Meng is a software engineer at Databricks. He has been actively involved in the development of Spark MLlib since he joined. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce.