SESSION

Locality Sensitive Hashing By Spark

Slides PDF Video

Locality Sensitive Hashing (LSH) is a randomized algorithm for solving Near Neighbor Search problem in high dimensional spaces. LSH has many applications in the areas such as machine learning and information retrieval. In this talk, we will discuss why and how we use LSH at Uber. Then, we will dive deep into the technical details of our LSH implementation. Our LSH library is designed and implemented to optimize the performance on Spark. It supports pluggable distance functions. Out of the box, Jaccard, Cosine, Hamming and Euclidean distance functions are included in the library. It also supports approximate near neighbor searches and self-similarity joins. In the talk, we will also share performance benchmark and our experience of running LSH on Spark in production clusters.

About Alain

Alain has been a part of Uber’s engineering team since it could be fed by a couple of pizzas. He implemented the first Spark-based ETL pipeline at Uber as part of the migration to a Hadoop centric data world. Nowadays he uses Spark to analyze and classify trip GPS data in search of new product opportunities. In years prior he worked in Uber’s realtime dispatch engine and later lead the development of a realtime metrics engine powering Uber’s business dashboards. In a prior life he used to tinker with embedded systems in consumer and aerospace electronics.

Kelvin Chu, Engineer at Uber

About Kelvin

Kelvin is a founding member of the Hadoop team at Uber. He is creating tools and services on top of Spark to support multi-tenancy and large scale computation-intensive applications. He is creator and lead engineer of Spark Uber Development Kit, Paricon and SparkPlug services which are main initiatives of Spark Compute at Uber. At Ooyala, he was co-creator of Spark Job Server which was an open source RESTful server for submitting, running, and managing Spark jobs, jars and contexts. He implemented real-time video analytics engines on top of it by datacube materializations via RDD.