Spark Summit 2013 brought the Apache Spark community together on December 2-3, 2013 at the Hotel Nikko in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
This talk will cover performance instrumentation of Spark jobs. I’ll begin by providing a high-level overview of Spark’s execution engine and reviewing how Spark jobs are compiled into lower-level execution plans. I’ll show how to use recently introduced instrumentation to diagnose performance bottlenecks in Spark jobs. I’ll cover common causes of bottlenecks, such as data skew, garbage collection, shuffle bottlenecks, or poorly chosen degrees of parallelism. For each of these causes, I’ll explain why they result in poor performance and how to mitigate their effects to improve performance. This talk will be targeted primarily at people already using Spark who want to attain a principled understanding of its operation.