San Francisco
June 30 - July 2, 2014
Databricks Spark Training 2015

Databricks Spark training is offered as part of the 2-day pass to the Spark Summit, and contains an introductory, advanced devops, and data science track. All sessions begin at 9am on March 19, 2015 and will end by 5pm. Lunch will be provided. Workshops will run on Databricks Cloud.

What’s required for the tutorials?

  • Bring your own laptop
  • Wifi enabled laptop with browsing capability
  • Reasonably current hardware (+2GB)
  • MacOSX, Windows, Linux — installed and functioning
  • Temporarily remove any corporate security controls that prevent use of conference network
  • Have Java JDK 6/7/8 installed
  • Have Python 2.7 installed

Note: Do not install Spark with Homebrew or Cygwin

Intro to Apache Spark

Watch the video HERE.

The Introduction to Apache Spark workshop is for developers to learn the core Spark APIs in-depth. This full-day course features hands-on technical exercises to get up to speed using Spark for data exploration, analysis, and building Big Data applications.
Topics covered include:

  • Overview of Big Data and Spark
  • Installing Spark Locally
  • Using Spark’s Core APIs in Scala, Java, Python
  • Building Spark Applications
  • Deploying on a Big Data Cluster
  • Combining SQL, Machine Learning, and Streaming for Unified Pipelines

Some experience coding in Python, Java, or Scala, plus some familiarity with Big Data issues/concepts.


Watch the video HERE.

Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud.
Topics covered include:

  • Data transformation techniques based on both Spark SQL and functional programming in Scala and Python.
  • Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation.
  • Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
  • Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
  • Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015.
  • Prerequisites:

    • Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
    • Experience coding in Scala, Python, SQL
    • Have some familiarity with Data Science topics (e.g., business use cases)


    Watch the video HERE.

    Instructor: Sameer Farooqui, Client Services Engineer at Databricks

    The slides for this class can be found here, licensed under a Creative Commons license.

    (note that actual slides used in class are subject to change)

    Have you opened the Spark shell, ran some transformations/actions and now want to take your Spark knowledge to the next level? This advanced course is for students who already have some beginner-level familiarity with Spark’s architecture and developer API (or have gone through the prerequisites listed below). Students attending can expect a fast paced, vendor agnostic, and very technical class on Spark.

    This class will have about 80% lecture and 20% hand-on labs. All students will get a temporary account on Databricks Cloud to try out the 2 Spark Core DevOps labs (one before lunch, one after). Also, additional labs will be given for students to complete after class.

    The focus of the class will be understanding the deep architecture of Spark Core to give students a solid foundation for learning the higher level libraries after class. However there will be some coverage of Spark SQL (~20 mins) and Spark Streaming (~45 minutes) as well. We will not be discussing MLlib or GraphX in this course at all.

    After this class, you will be able to:

    • Explain how Spark runs differently on Hadoop versus Cassandra
    • Understand how a Spark application breaks down into: Jobs -> Stages -> Tasks
    • Write more performant Spark code by chaining transformations in intelligent ways, using custom serialization, and tweaking Spark core settings
    • Explain the pros + cons of writing Spark code in Scala vs. Python vs. Java
    • Explain how a developer’s Spark application code executes from an operational perspective in a large cluster (DevOps)

    Class outline:

    1. Spark Architecture on various Resource Managers
      • Resource Managers covered: Local mode, Standalone Scheduler, YARN, Mesos
      • How these JVMs interact with each other: Driver, Executor, Worker, Spark Master
      • Using Spark Submit to submit a Spark application to any of the resource managers
      • Considerations for how much RAM to assign each JVM
      • How to think about set the # of task slots in each Executor based on different workload requirements
      • How to deal with Out of Memory errors in Spark (& identifying if the OOM is occurring in the Driver or Executor)
    2. Resilient Distributed Datasets
      • Understanding lineage: Narrow vs. Wide dependencies
      • Demystifying how a Spark application breaks down into Jobs -> Stages -> Tasks
      • How to choose between the 7 different persistence levels for caching RDDs
      • Becoming aware of the ~20 types of RDDs (HadoopRDD, MappedRDD, FilteredRDD, CassandraRDD, SchemaRDD, etc)
      • Using the “Preserves Partitioning” parameter to make shuffles in Spark run faster
      • When and why to use Broadcast Variables and Accumulators
      • Architecture of how the new bittorent implementation of Broadcast Variables works
    3. Use case: How the 100-TB sort competition was won
      • Architecture of the 2 shuffle implementations in Spark 1.2: HASH and SORT
      • Details of the EC2 hardware and software settings used for winning the 100-TB sort benchmark in 2014 (
      • External Shuffle Service
      • Netty Native Transport (aka zero-copy)
    4. PySpark
      • Architecture of how Python user code gets shipped via the Driver and Executor JVMs to actual Python PIDs running in the distributed cluster
      • Explanation of how user code runs in Python PIDs, but MLlib/SQL/Shuffle runs directly in Executor JVMs
      • When and why to choose different implementations of Python (CPython vs PyPY)
      • Performance considerations when developing in PySpark
    5. Spark SQL
    6. Note: This will be a relatively short section

      • Leveraging SchemaRDDs and Data Frames
      • Comparing the architecture of Apache Hive vs. Spark SQL
      • Using Parquet Files
    7. Spark Streaming

    Note: This will also be a relatively short section

      • Understanding the Streaming Architecture: How DStreams break down into RDD batches
      • How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
      • Increasing throughput performance in Spark Streaming via multiple receivers and the Union transformation
      • Sliding window operations on DStreams

    Also, remember to bring your laptop to class. There is no need to have Spark pre-installed or Java/Python installed on your laptop. You just need Chrome or Firefox to access Databricks Cloud.

    All students are strongly advised to complete the following prerequisites BEFORE class starts:
    1 hour: Watch this introductory video on how Spark works:

    2 hours: Read the Spark RDD white paper

    30 mins: Skim through the official Spark 1.2 documentation. Click on the links in the top navigation bar under Programming Guides, API docs, Deploying and More to get familiar with the breadth of topics. You don’t need to read the entire documentation here, but you should be familiar with what the different sections are.

    1 hour: Install Spark (either via an Apache download or use CDH/HDP/MapR/DSE) and using the Python or Scala shells run through some basic transformations (map, filter, distinct, etc) and actions (collect, count, reduce, first, saveAsTextFile, etc).

    If you plan on using Spark against Hadoop or Cassandra, one of the following two documents will get you started:
    Spark + Hadoop
    Spark + Cassandra