Open Source Apache Spark Training by Databricks

Apache Spark training is offered as part of the 3-day pass to the Spark Summit and contains an introductory course, a data science track, and an advanced track. All sessions are hands-on learning opportunities that begin at 9am on February 16, 2016 and will end by 5pm. Lunch will be provided. Workshops will run on Databricks.

Labs will be completed on Databricks for all 3 classes. Students will get a free account on Databricks for the training class.

Trainings offered at Spark Summit East 2016:

What’s required for the tutorials?

  • Bring your own Wifi enabled laptop with Chrome or Firefox installed.





Apache Spark Essentials will help you get productive with the core capabilities of Spark, as well as provide an overview and examples for some of Spark’s more advanced features. This full-day course features hands-on technical exercises so that you can become comfortable applying Spark to your datasets. In this class, you will get hands-on experience with ETL, exploration, and analysis using real world data.


Some experience in Scala and Python. Some familiarity with big data or parallel processing concepts is helpful.

Topics covered include:

  • Overview: Big Data and Spark
  • Parallel Processing with Resilient Distributed Datasets (RDDs)
  • Transformations and Actions on Data using RDDs
  • Structured Data Processing with DataFrames
  • DataFrames and Spark SQL for Easier, Faster, Scalable Computing
  • Spark Architecture and Cluster Deployment
  • Memory and Performance
  • Overview: Machine Learning and Streaming Data Sources





The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in Spark ML and Spark MLlib. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using ML and MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the ML and MLlib APIs, and related documentation. These building blocks will enable you use Apache Spark to solve a variety of data analysis and machine learning tasks.


Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark.  Knowledge of RDDs and basic RDD transformations is assumed.  No prior DataFrame or MLlib / ML knowledge is required.  

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be presented using PySpark, so knowledge of Python will help; however, labs will also be made available in Scala.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • ML Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics 





The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands on labs + demos.


  • Have at least one month of hands on experience working with Apache Spark
  • Understand the difference between the Driver and Executor JVMs
  • Know how the following transformations work: map(), flatMap(), filter(), sample(), distinct(), union(), join(), groupByKey(), reduceByKey(), repartition()
  • How the following actions work: collect(), count(), saveAsTextFile(), first(), take(), reduce()
  • Basic usage of DataFrames
  • Basic programming experience in an object oriented or functional language (the class will mostly be taught in Scala)

Topics covered include:

  • Overview: Wikipedia and Spark
  • Analyze data using:
    • DataFrames + Spark SQL
    • RDDs
    • Spark Streaming
    • MLlib (Machine Learning)
    • GraphX
  • Leveraging knowledge of Spark’s Architecture for performance tuning and debugging
  •  How and when to use advanced Spark features:
    • Accumulators
    • Broadcast variables
    • Memory persistence levels
    • Spark UI details