Efficient data access is one of the key factors for having a high performance data processing pipeline. Determining the layout of data values in the filesystem often has fundamental impacts on the performance of data access. In this talk, we will show insights on how data layout affects the performance of data access. We will first explain how modern columnar file formats like Parquet and ORC work and explain how to use them efficiently to store data values. Then, we will present our best practice on how to store datasets, including guidelines on choosing partitioning columns and deciding how to bucket a table.
Session hashtag: #SFexp20
Cheng got in touch with Spark since late 2013 and joined Databricks in early 2014 as one of the main developers behind Spark SQL. Now he’s a committer of Apache Spark and Apache Parquet. His current areas of interest include databases and programming languages.
Vida is currently a Solutions Engineer at Databricks where her job is to onboard and support customers using Spark on Databricks Cloud. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google, where she improved search rankings of mobile-specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.