Migrating Complex Data Aggregation from Hadoop to Spark

Slides PDF Video

This talk discusses our experience of moving from Hadoop MR to Spark. Our initial implementation used a multiple stage aggregation framework within Hadoop MR to join, de-dupe, and group 12TB of incoming data every 3 hours. There was an additional requirement to join other heterogeneous data sources along with implementation of algorithms like HyperLogLog. The Hadoop MR Cluster size and cost exponentially increased with scale for these use-cases so we evaluated Spark. A Spark 1.1 cluster ran the aggregation pipeline with 60% fewer nodes and 30% cost savings as compared to Hadoop MR within the SLA limits. The HyperLogLog implementation ran 5-8X faster than Hadoop MR on the same number of nodes. Optimizations and tuning were necessary on serialization (Kyro), parallelism, partitioning, compression, batch size, and memory assignments. The Spark extension provided by Cascading was used to migrate the code to Spark.

Photo of Ashish Singh

About Ashish

Ashish is currently working as Senior Software Engineer in the Data Team at Pubmati. Most recently he has been working on upgrading audience analytics to Spark platform for efficient reporting. Ashish is interested in building large scale aggregation and analytics platforms and loves experimenting with Probabilistic data structures to solve complex compute problems.

Photo of Puneet Kumar

About Puneet

Puneet Kumar is Data Architect at PubMatic Inc. and is responsible for schema design and data-pipes. Previously he was Lead Developer and ETL Architect at Amdocs.