Training courses offered at Spark Summit EU 2017

Spark Summit EU 2017 features a number of hands-on training workshops designed to help improve your Apache Spark skills. See below for the list of available courses.

Please note that training is offered as a standalone ticket. If you wish to attend any Spark Summit EU 2017 conference sessions or networking activities on October 25th or 26th, you must register for a Conference Only in addition to your Training pass.

Location and Schedule

All training courses will be held at the Convention Centre Dublin. Please visit the Venue page for more details, including hotel and travel discounts.
Training will take place from 9AM to 5PM on Tuesday, October 24th. It will include training and lunch on that day only.

Adding a Conference Pass to Your Training

To attend any of the Spark Summit 2017 conference sessions or networking activities on October 25th and 26th, including the welcome reception in the Expo Hall, you’ll need to register for a Conference Only in addition to your Training pass.

What’s Required for the Tutorials?

Bring your own WiFi-enabled laptop with Google Chrome or Firefox installed.



Understand and Apply Deep Learning with Keras, TensorFlow, and Apache Spark

Instructor: Adam Breindel

This Deep Learning workshop introduces the conceptual background as well as implementation for key architectures in neural network machine learning models. We will see how and why deep learning has become such an important and popular technology, and how it is similar to and different from other machine learning models as well as earlier attempts at neural networks.

We’ll see how deep learning models can be used to enhance your traditional business analytics, in addition to covering the famous cases like image recognition, language processing, and autonomous agents. Most of our models will be built with the Keras API/Library, but we’ll also take a look at “what’s under the hood” with TensorFlow. But we won’t just hack demos: our goal is to develop an intuition for the key concepts and issues at play in deep learning.

The class will also feature a discussion about using Apache Spark for training and inference, and other deployment / operational concerns. Along the way, we’ll hopefully explain enough ideas and terminology that you’ll be comfortable going further with deep learning on your own!


Familiarity with the basics of Python and with common ideas and techniques in machine learning / predictive analytics. You should be be familiar with classification vs. regression problems, supervised vs. unsupervised learning, bias-variance tradeoff, and common evaluation metrics like RMSE, precision, and recall.

No prior deep learning knowledge, vector calculus, or Spark experience is required.

Topics covered include

  • Neural nets before and after the 2006 revolution
  • Perceptrons and Deep Feed-Forward Networks
  • Capturing information and choosing error functions
  • Convolutional Networks
  • How networks are trained and what can go wrong
  • Recurrent Networks
  • Generative Models
  • Reinforcement Learning
  • Deep Learning inference at scale with Apache Spark
  • Approaches to distributed training, including Spark



Data Science with Apache Spark 2.x

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.


Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark are required. Familiarity with the concept of a DataFrame is helpful.

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.

Topics covered include

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
    Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics



Apache Spark Tuning and Best Practices

This 1-day course is for data engineers, analysts, architects, dev-ops, and team-leads interested in troubleshooting and optimizing Apache Spark applications. It covers troubleshooting, tuning, best practices, anti-patterns to avoid, and other measures to help tune and troubleshoot Spark applications and queries.

Each topic includes lecture content along with hands-on use of Spark through an elegant web-based notebook environment. Inspired by tools like IPython/Jupyter, notebooks allow attendees to code jobs, data analysis queries, and visualizations using their own Spark cluster, accessed through a web browser. Students may keep the notebooks and continue to use them with the free Databricks Community Edition offering; all examples are guaranteed to run in that environment. Alternatively, each notebook can be exported as source code and run within any Spark environment.

Learning Objectives

After taking this class, students will:

  • understand the role of memory in Spark applications
  • understand how to use broadcast variables and, in particular, broadcast joins to increase the performance of DataFrame operations
  • understand the Catalyst query optimizer
  • understand how to tuning Spark’s partitioning and shuffling behavior
  • understand how best to size a Spark cluster for different kinds of workflows


  • Project experience or with Apache Spark
  • Databricks SPARK 105 or equivalent
  • Basic programming experience in an object oriented or functional language is required. The class will be taught in a mixture of Python and Scala.


  • Spark Memory Usage
    • Using the Spark UI and Spark logs to determine how much memory your application is using
    • Understanding how Tungsten (used by DataFrames and Datasets) dramatically improves memory use, compared to the RDD API
    • Why it’s important that DataFrames never be partially cached, even if it means spilling the cache to disk
    • The benefits of co-located data
    • Tuning JVM garbage collection for Spark
  • Broadcast Variables
    • How broadcast variables can affect performance
    • Why broadcast joins are useful
    • How to force Spark to do a broadcast join
    • When not to force a broadcast join
  • Catalyst
    • Avoiding Catalyst anti-patterns, such as Cartesian products and partially cached DataFrames
    • Efficient use of the Datasets API within a query plan
    • Understanding how encoders and decoders affect Catalyst optimizations
    • When and how to write your own custom Catalyst optimizers
  • Tuning Shuffling
    • When does shuffling occur?
    • Understanding how shuffling affects repartitioning
    • Understanding shuffling impact on network I/O
    • Narrow vs. wide transformations
    • Spark configuration settings that affect shuffling
  • Cluster Sizing
    • How a lack of memory affects how you should size your disks
    • The importance of properly defined schemas on memory use
    • Hardware provisioning
      • How to decide how much memory to allocate to each machine
      • Network considerations
      • How to decide how many CPU cores each machine will need
    • FIFO scheduler vs. fair scheduler