This talk will focus on the practical details of building a recommendation engine on top of Spark’s ML Lib ALS collaborative-filtering algorithm that can reliably generate predictions for 25 million users from a space of 5 million products. The unique aspect of this work is two-fold. First, we are able to generate scores for every combination of user and product (125 trillion possible values) on a small 6-node cluster. Secondly, clever optimization provides several orders of magnitude improvement over ML Lib’s predictive step with linear performance scaling as more cores are added to the system. The primary goal is to present the optimizations and parameter tuning necessary to achieve these gains coupled with a discussion of the Spark internals that come into play. The talk will be tailored for the intermediate Spark developer who wishes to understand the trickier aspects of Spark and how these affect both stability and performance.
I’m an electrical engineer turned data scientist. I started out doing robotics, then did a stint at the University of Michigan studying how to make robots that can learn and develop the way that human children do. This venture into developmental robotics got me interested in machine learning and after a few years doing DSP work with cellular phones and software defined radios I’m now doing what I’ve wanted to for years – big data :). I’m delving deep into the world of distributed computing and the architectures that support it looking to create a powerful new recommender system.