Data Storage Tips for Optimal Spark Performance

Slides PDF Video

Spark can analyze data stored on files in many different formats: plain text, JSON, XML, Parquet, and more. But just because you can get a Spark job to run on a given data input format doesn’t mean you’ll get the same performance with all of them. Actually, the performance difference can be quite substantial. This talk will cover some common data input formats and nuances about working with that format. The goal for the talk is to help Spark programmers make more conscientious and smart decisions about how to store their data. Here is an example of the topics that will be covered in the talk: – Issues you’ll encounter when processing excessively large XML input files. – Why choose parquet files for Spark SQL? – How coalescing many small files may give you better performance.

Photo of Vida Ha

About Vida

Vida is currently a Solutions Engineer at Databricks where her job is to onboard and support customers using Spark on Databricks Cloud. In her past, she worked on scaling Square’s Reporting Analytics System. She first began working with distributed computing at Google – where she improved search rankings of mobile specific web content and built and tuned language models for speech recognition using a year’s worth of Google search queries. She’s passionate about accelerating the adoption of Apache Spark to bring the combination of speed and scale of data processing to the mainstream.