Apache Spark 2.2 ships with a state-of-art cost-based optimization framework that collects and leverages a variety of per-column data statistics (e.g., cardinality, number of distinct values, NULL values, max/min, avg/max length, etc.) to improve the quality of query execution plans. Leveraging these reliable statistics helps Spark to make better decisions in picking the most optimal query plan. Examples of these optimizations include selecting the correct build side in a hash-join, choosing the right join type (broadcast hash-join vs. shuffled hash-join) or adjusting a multi-way join order, among others. In this talk, we’ll take a deep dive into Spark’s cost based optimizer and discuss how we collect/store these statistics, the query optimizations it enables, and its performance impact on TPC-DS benchmark queries. Talk contributors: Zhenhua Wang (Huawei Technologies) and Wenchen Fan (Databricks) Session hashtag: #SFdd2
Zhenhua Wang is a Research Engineer at Huawei Technologies where he works on building a big data analytics platform based on Apache Spark. Prior to joining Huawei, he received a PhD degree in Computer Science from Zhejiang University. His research interests include information retrieval and web data mining.