Best Practices for Running PySpark

Slides PDF Video

PySpark (component of Spark allows users to write their code Python) has grabbed the attention of Python programmers who analyze and process data for a living. The appeal is obvious- you don’t need to learn a new language, and you still have access to modules (i.e., pandas, nltk, statsmodels, etc.) that you are familiar with, but you are able to run complex computations quickly and at scale using the power of Spark. The drawbacks of using Python in a distributed environment only become apparent when you try to deploy your application and run an analysis against real-world data. The reality of using PySpark is that: * Managing dependencies and their installation on a cluster is crucial. * Duck typing in Python can let bugs in your code slip by, only to be discovered when you run it against a large and inevitably messy data set. * You must understand the underlying Spark computational model- particularly where and when various blocks of code get executed- in order to write applications that will work correctly when distributed across a cluster. In this talk, we will examine a real PySpark job that runs a statistical analysis of time series data to motivate the issues described above and provides a concrete example of best practices for real world PySpark applications. We will cover: * Python package management on a cluster using virtualenv. * Testing PySpark applications. * Spark’s computational model and its relationship to how you structure your code.

Photo of Juliet Hougland

About Juliet

Juliet is a Data Scientist at Cloudera, and contributor/committer/maintainer for the Sparkling Pandas project. Her commercial applications of data science include developing predictive maintenance models for oil & gas pipelines at Deep Signal, and designing/building a platform for real-time model application, data storage, and model building at WibiData. Juliet was the technical editor for Learning Spark by Karau et al. and Advanced Analytics with Spark by Ryza et al. She holds an MS in Applied Mathematics from University of Colorado, Boulder and graduated Phi Beta Kappa from Reed College with a BA in Math-Physics.