Spark Summit 2013 brought the Apache Spark community together on December 2-3, 2013 at the Hotel Nikko in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Yahoo’s Audience Expansion (AEX) uses advanced custom Perform A-Like (PAL) modeling to find new, incremental users who exhibit online behavior that is similar to an advertiser’s existing customers or converters. Over the years, Yahoo has developed a sophisticated AEX pipeline based on Hadoop streaming, and refreshes audience models periodically.
In this talk, we will present our recent effort to migrate AEX pipeline from Hadoop streaming to Spark. We aim to reduce audience model to be refreshed at least 2x faster. We came up an innovative migration solution that requires no code changes at all.
We’ll present some Spark scalability enhancements to enable having over 50k partitions to process 3TB data without mapper memory issue or shuffle IO issue. Our Spark enhancements reduce memory consumption in reducer by 30x with minimum compression overhead. We have also improved Spark to tolerate faulures better, and identified several useful performance tuning tricks for improving resource utilization (CPU, memory and IO).