Open Source Apache Spark Training from Databricks

Apache Spark training from Databricks is offered as part of the 3-day pass to the Spark Summit and contains access to a data science or exploring public data with Spark courses. All sessions are hands-on learning opportunities that begin at 9 AM on February 7, 2017, and will end by 5 PM. Lunch will be provided. Workshops will run on Databricks Community Edition. Attendees will be given complimentary access.

Trainings offered at Spark Summit East 2017

What’s required for the tutorials?

Bring your own wifi-enabled laptop with Google Chrome or Firefox installed.



Data Science with Apache Spark 2.x (SOLD OUT)


The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.


Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark. Knowledge of RDDs and basic RDD transformations is assumed. No prior DataFrame or MLlib knowledge is required.

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be presented using PySpark, so knowledge of Python will help; however, labs will also be made available in Scala.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics



Exploring Wikipedia with Apache Spark (SOLD OUT)


The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. In class, we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands-on labs + demos.


  • Some familiarity with Apache Spark is helpful.
  • Basic usage of Spark DataFrames is helpful.
  • Basic programming experience in an object oriented or functional language (the class will mostly be taught in Scala) is required.

Topics covered include:

  • Overview: Wikipedia and Spark
  • Analyze data using:
    • DataFrames + Spark SQL
    • RDDs
    • Spark Streaming
    • MLlib (Machine Learning)
    • GraphFrames
  • Leveraging knowledge of Spark’s Architecture for performance tuning and debugging
  • How and when to use advanced Spark features:
    • Accumulators
    • Broadcast variables
    • Memory persistence levels
    • Spark UI details