Databricks Spark Training 2015

Databricks Spark training is offered as part of the 3-day pass to the Spark Summit, and contains an introductory, advanced devops, and data science track. All sessions begin at 9am on June 17, 2015 and will end by 5pm. Lunch will be provided. Workshops will run on Databricks.

Labs will be completed on Databricks for all 3 classes. Students will get a free limited account on Databricks for the training class.

Trainings offered at Spark Summit 2015:

What’s required for the tutorials?

  • Bring your own Wifi enabled laptop with Chrome or Firefox installed.




Instructor: Brian Clapper

The Introduction to Apache Spark workshop is for developers to learn the core Spark APIs in-depth. This full-day course features hands-on technical exercises to get up to speed using Spark for data exploration, analysis, and building Big Data applications.

Topics covered include:

  • Overview of Big Data and Spark
  • Installing Spark Locally
  • Using Spark’s Core APIs in Scala, Java, Python
  • Building Spark Applications
  • Deploying on a Big Data Cluster
  • Combining SQL, Machine Learning, and Streaming for Unified Pipelines

Some experience coding in Python, Java, or Scala, plus some familiarity with Big Data issues/concepts.




Instructor: Reza Zadeh

This class is for developers to learn how to combine the scalability of Spark with machine learning and graph processing. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on building and using machine learning at scale via MLlib and GraphX. Coding examples based on public data sets will be provided, leveraging cloud-based notebooks. Includes limited free accounts on Databricks.

Topics covered include:

  • Building scalable Machine Learning algorithms on Spark, discussing design decisions inside MLlib and GraphX
  • Predictive analytics based on MLlib and building classifiers with a variety of algorithms
  • Understand how primitives like Matrix Factorization are implemented in a distributed framework from the designers of MLlib
  • Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015, including take-home exercises.
  • The class will be based on a graduate-level class taught at Stanford: CME 323


  • Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
  • Experience coding in Scala, Python, SQL
  • Familiarity with Data Science and Machine Learning




Instructor: Sameer Farooqui

A small sample of the slides for this class can be found here, licensed under a Creative Commons license.

(note that actual slides used in class are subject to change)

Have you opened the Spark shell, ran some transformations/actions and now want to take your Spark knowledge to the next level? This advanced course is for students who already have some beginner-level familiarity with Spark’s architecture and developer API (or have gone through the prerequisites listed below). Students attending can expect a fast paced, vendor agnostic, and very technical class on Spark.

This class will have about 80% lecture and 20% hand-on labs. All students will get a temporary account on Databricks to try out the 2 Spark Core DevOps labs (one before lunch, one after). Also, additional labs will be given for students to complete after class.

The focus of the class will be understanding the deep architecture of Spark Core to give students a solid foundation for learning the higher level libraries after class. However there will be some coverage of Spark SQL (~20 mins) and Spark Streaming (~45 minutes) as well. We will not be discussing MLlib or GraphX in this course at all.


After this class, you will be able to:

  • Explain how Spark runs differently on Hadoop versus Cassandra
  • Understand how a Spark application breaks down into: Jobs -> Stages -> Tasks
  • Write more performant Spark code by chaining transformations in intelligent ways, using custom serialization, and tweaking Spark core settings
  • Explain the pros + cons of writing Spark code in Scala vs. Python vs. Java
  • Explain how a developer’s Spark application code executes from an operational perspective in a large cluster (DevOps)


Class outline:

  1. Spark Architecture on various Resource Managers
    • Resource Managers covered: Local mode, Standalone Scheduler, YARN, Mesos
    • How these JVMs interact with each other: Driver, Executor, Worker, Spark Master
    • Using Spark Submit to submit a Spark application to any of the resource managers
    • Considerations for how much RAM to assign each JVM
    • How to think about set the # of task slots in each Executor based on different workload requirements
    • How to deal with Out of Memory errors in Spark (& identifying if the OOM is occurring in the Driver or Executor)


  2. Resilient Distributed Datasets
    • Understanding lineage: Narrow vs. Wide dependencies
    • Demystifying how a Spark application breaks down into Jobs -> Stages -> Tasks
    • How to choose between the 7 different persistence levels for caching RDDs
    • Becoming aware of the ~20 types of RDDs (HadoopRDD, MappedRDD, FilteredRDD, CassandraRDD, SchemaRDD, etc)
    • Using the “Preserves Partitioning” parameter to make shuffles in Spark run faster
    • When and why to use Broadcast Variables and Accumulators
    • Architecture of how the new bittorent implementation of Broadcast Variables works


  3. Use case: How the 100-TB sort competition was won
    • Architecture of the 2 shuffle implementations in Spark 1.2: HASH and SORT
    • Details of the EC2 hardware and software settings used for winning the 100-TB sort benchmark in 2014
    • External Shuffle Service
    • Netty Native Transport (aka zero-copy)


  4. PySpark
    • Architecture of how Python user code gets shipped via the Driver and Executor JVMs to actual Python PIDs running in the distributed cluster
    • Explanation of how user code runs in Python PIDs, but MLlib/SQL/Shuffle runs directly in Executor JVMs
    • When and why to choose different implementations of Python (CPython vs PyPY)
    • Performance considerations when developing in PySpark


  5. Spark SQL Note: This will be a relatively short section
    • Leveraging SchemaRDDs and Data Frames
    • Comparing the architecture of Apache Hive vs. Spark SQL
    • Using Parquet Files


  6. Spark Streaming Note: This will also be a relatively short section
    • Understanding the Streaming Architecture: How DStreams break down into RDD batches
    • How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
    • Increasing throughput performance in Spark Streaming via multiple receivers and the Union transformation
    • Sliding window operations on DStreams


All students are strongly advised to complete the following prerequisites BEFORE class starts:

1 hour: Watch this introductory video on how Spark works.

30 mins: Skim through the official Spark 1.2 documentation. Click on the links in the top navigation bar under Programming Guides, API docs, Deploying and More to get familiar with the breadth of topics. You don’t need to read the entire documentation here, but you should be familiar with what the different sections are.

1 hour: Install Spark (either via an Apache download or use CDH/HDP/MapR/DSE) and using the Python or Scala shells run through some basic transformations (map, filter, distinct, etc) and actions (collect, count, reduce, first, saveAsTextFile, etc).

If you plan on using Spark against Hadoop or Cassandra, one of the following two documents will get you started:
Spark + Hadoop
Spark + Cassandra
Also, remember to bring your laptop to class. There is no need to have Spark pre-installed or Java/Python installed on your laptop. You just need Chrome or Firefox to access Databricks.