San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Advanced Apache Spark


Instructor: Sameer Farooqui, Client Services Engineer at Databricks A small sample of the slides for this class can be found here, licensed under a Creative Commons license. (note that actual slides used in class are subject to change) Have you opened the Spark shell, ran some transformations/actions and now want to take your Spark knowledge to the next level? This advanced course is for students who already have some beginner-level familiarity with Spark’s architecture and developer API (or have gone through the prerequisites listed below). Students attending can expect a fast paced, vendor agnostic, and very technical class on Spark. This class will have about 80% lecture and 20% hand-on labs. All students will get a temporary account on Databricks Cloud to try out the 2 Spark Core DevOps labs (one before lunch, one after). Also, additional labs will be given for students to complete after class. The focus of the class will be understanding the deep architecture of Spark Core to give students a solid foundation for learning the higher level libraries after class. However there will be some coverage of Spark SQL (~20 mins) and Spark Streaming (~45 minutes) as well. We will not be discussing MLlib or GraphX in this course at all. After this class, you will be able to:

  • Explain how Spark runs differently on Hadoop versus Cassandra
  • Understand how a Spark application breaks down into: Jobs -> Stages -> Tasks
  • Write more performant Spark code by chaining transformations in intelligent ways, using custom serialization, and tweaking Spark core settings
  • Explain the pros + cons of writing Spark code in Scala vs. Python vs. Java
  • Explain how a developer’s Spark application code executes from an operational perspective in a large cluster (DevOps)

Class outline:

    1. Spark Architecture on various Resource Managers
      • Resource Managers covered: Local mode, Standalone Scheduler, YARN, Mesos
      • How these JVMs interact with each other: Driver, Executor, Worker, Spark Master
      • Using Spark Submit to submit a Spark application to any of the resource managers
      • Considerations for how much RAM to assign each JVM
      • How to think about set the # of task slots in each Executor based on different workload requirements
      • How to deal with Out of Memory errors in Spark (& identifying if the OOM is occurring in the Driver or Executor)
    2. Resilient Distributed Datasets
      • Understanding lineage: Narrow vs. Wide dependencies
      • Demystifying how a Spark application breaks down into Jobs -> Stages -> Tasks
      • How to choose between the 7 different persistence levels for caching RDDs
      • Becoming aware of the ~20 types of RDDs (HadoopRDD, MappedRDD, FilteredRDD, CassandraRDD, SchemaRDD, etc)
      • Using the “Preserves Partitioning” parameter to make shuffles in Spark run faster
      • When and why to use Broadcast Variables and Accumulators
      • Architecture of how the new bittorent implementation of Broadcast Variables works
    3. Use case: How the 100-TB sort competition was won
      • Architecture of the 2 shuffle implementations in Spark 1.2: HASH and SORT
      • Details of the EC2 hardware and software settings used for winning the 100-TB sort benchmark in 2014 (
      • External Shuffle Service
      • Netty Native Transport (aka zero-copy)
    4. PySpark
      • Architecture of how Python user code gets shipped via the Driver and Executor JVMs to actual Python PIDs running in the distributed cluster
      • Explanation of how user code runs in Python PIDs, but MLlib/SQL/Shuffle runs directly in Executor JVMs
      • When and why to choose different implementations of Python (CPython vs PyPY)
      • Performance considerations when developing in PySpark
    5. Spark SQL

Note: This will be a relatively short section

    • Leveraging SchemaRDDs and Data Frames
    • Comparing the architecture of Apache Hive vs. Spark SQL
    • Using Parquet Files
  1. Spark Streaming

Note: This will also be a relatively short section

    • Understanding the Streaming Architecture: How DStreams break down into RDD batches
    • How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
    • Increasing throughput performance in Spark Streaming via multiple receivers and the Union transformation
    • Sliding window operations on DStreams

Also, remember to bring your laptop to class. There is no need to have Spark pre-installed or Java/Python installed on your laptop. You just need Chrome or Firefox to access Databricks Cloud. All students are strongly advised to complete the following prerequisites BEFORE class starts: 1 hour: Watch this introductory video on how Spark works: 2 hours: Read the Spark RDD white paper 30 mins: Skim through the official Spark 1.2 documentation. Click on the links in the top navigation bar under Programming Guides, API docs, Deploying and More to get familiar with the breadth of topics. You don’t need to read the entire documentation here, but you should be familiar with what the different sections are. 1 hour: Install Spark (either via an Apache download or use CDH/HDP/MapR/DSE) and using the Python or Scala shells run through some basic transformations (map, filter, distinct, etc) and actions (collect, count, reduce, first, saveAsTextFile, etc). If you plan on using Spark against Hadoop or Cassandra, one of the following two documents will get you started: Spark + Hadoop Spark + Cassandra