Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries. Talk contributors: Zhenhua Wang (Huawei Technologies) and Wenchen Fan (Databricks) Session hashtag: #SFdd2
Ron Hu is a Database System Architect at Huawei Technologies where he works on building a big data analytics platform based on Apache Spark. Prior to joining Huawei, he used to work at Teradata, Sybase, and MarkLogic with a focus on parallel database systems and search engine. Ron holds a PhD in Computer Science from University of California, Los Angeles.
Sameer Agarwal is a Software Engineer at Databricks working on Spark core and Spark SQL. Previously, he received his PhD in Databases from UC Berkeley AMPLab where he worked on BlinkDB, an approximate query engine for Spark.