Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
This talk will present a technical “”deep-dive”” into Spark that focuses on its internal architecture. The content will be geared towards those already familiar with the basic Spark API who want to gain a deeper understanding of how it works and become advanced users or Spark developers.
This talk will walk through the major internal components of Spark:
The RDD data model, the scheduling subsystem, and Spark’s internal block-store service. For each component we’ll describe its
architecture and role in job execution. We’ll also provide examples of how higher level libraries like SparkSQL and MLLib interact with the core Spark API. Throughout the talk we’ll cover advanced topics like data serialization, RDD partitioning, and user-defined RDD’s, with a focus on actionable advice that users can apply to their own workloads.
Aaron Davidson is an Apache Spark committer and software engineer at Databricks. He has implemented Spark standalone cluster fault tolerance and shuffle file consolidation, and has helped in the design, implementation, and testing of Spark`s external sorting and driver fault tolerance.