Improving Python and Spark Performance and Interoperability with Apache Arrow

Slides PDF Video

Apache Spark has become a popular and successful way for Python programming to parallelize and scale up data processing. In many use cases though, a PySpark job can perform worse than an equivalent job written in Scala. It is also costly to push and pull data between the user’s Python environment and the Spark master.

Apache Arrow-based interconnection between the various big data tools (SQL, UDFs, machine learning, big data frameworks, etc.) enables you to use them together seamlessly and efficiently, without overhead. When collocated on the same processing node, read-only shared memory and IPC avoid communication overhead. When remote, scatter-gather I/O sends the memory representation directly to the socket avoiding serialization costs.

Session hashtag: #SFdev3

Julien Le Dem, Architect at Dremio

About Julien

Julien Le Dem is the coauthor of Apache Parquet and the PMC chair of the project. He is also a committer and PMC Member on Apache Arrow. Julien is an architect at Dremio and was previously the tech lead for Twitter’s data processing tools, where he also obtained a two-character Twitter handle (@J_). Prior to Twitter, Julien was a principal engineer and tech lead working on content platforms at Yahoo, where he received his Hadoop initiation. His French accent makes his talks particularly attractive.

Li Jin, Distributed System Engineer at Two Sigma Investments, LP

About Li

Li Jin is a distributed system engineer at Two Sigma. Li focuses on building high performance data analysis tools with Spark. Li is a co-creator of Flint: a time series analysis library on Spark. Previously, Li worked on building large scale task scheduling system. In his spare time, Li loves hiking, traveling and winter sports.