Training courses offered at Spark Summit 2017:

Spark Summit 2017 features a number of hands-on training workshops designed to help improve your Apache Spark skills. You can choose from 1-day or 2-day training; see below for the list of available courses.

Please note that training is offered as a standalone ticket. If you wish to attend any Spark Summit 2017 conference sessions or networking activities on June 6th and 7th, you must register for a Conference Only or All-Access pass in addition to your Training pass.

Location and Schedule

All training courses will be held at the Moscone West Convention Center in San Francisco. Please visit the Venue page for more details, including hotel and travel discounts.

1-day training will take place from 9AM to 5PM on Monday, June 5th. It will include training and lunch on that day only.

2-day training will take place from 9AM to 5PM on Monday, June 5th and will continue from 11AM to 6PM on Tuesday, June 6th. It will include training and lunch on those two days only.

Adding a Conference Pass to Your Training

To attend any of the Spark Summit 2017 conference sessions or networking activities on June 6 & 7, including the welcome reception in the Expo Hall or the JOIN attendee party, you’ll need to register for a Conference Only or All-Access pass in addition to your Training pass. For those enrolled in 2-day training, if you add on a conference pass, you’ll be able to attend the morning Keynotes on June 6th before your training starts at 11AM. However, please be aware that your training will overlap with the afternoon sessions that day (we’ll share videos of those conference sessions with you a week after the event ends).

What’s required for the tutorials?

Bring your own WiFi-enabled laptop with Google Chrome or Firefox installed.

 


 

Exploring Wikipedia with Apache Spark

1-day course

Instructor: Akmal Chaudhri

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real-time stream analysis, machine learning, graph processing, and visualizations. In class, we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands-on labs + demos.

Prerequisites:

  • Some familiarity with Apache Spark is helpful.
  • Basic usage of Spark DataFrames is helpful.
  • Basic programming experience in an object oriented or functional language (the class will mostly be taught in Scala) is required.

Topics covered include:

  • Overview: Wikipedia and Spark
  • Analyze data using:
    • DataFrames + Spark SQL
    • RDDs
    • Spark Streaming
    • MLlib (Machine Learning)
    • GraphFrames
  • Leveraging knowledge of Spark’s Architecture for performance tuning and debugging
  • How and when to use advanced Spark features:
    • Accumulators
    • Broadcast variables
    • Memory persistence levels
    • Spark UI details

 


 

Just enough Scala for Spark

1-day course

Instructor: Chaoran Yu, Software Engineer, Fast Data, Lightbend Inc.

Apache Spark is written in Scala. Hence, many if not most data developers adopting Spark are also adopting Scala, while Python and R remain popular with data scientists. I think that Spark shows Scala at its best and largely hides the more difficult aspects of the language. This tutorial introduces you to the core features of Scala you need to be productive with Spark quickly, using hands-on exercises with the Spark APIs. It’s designed for developers, data scientists interested in using Scala for Spark. Using hands-on examples, you’ll learn the most important Scala syntax, idioms, and APIs for Spark development.

Prerequisites:

  • Familiarity with Spark is recommended, but not essential. Familiarity with Java is recommended.

Topics covered include:

  • Declaring and using classes, methods, and functions.
  • Immutable values vs. mutable variables.
  • Type inference: get your type safety without a lot of boilerplate.
  • Pattern matching: my favorite feature of Scala that’s so useful for Spark.
  • Scala collections and the common operations on them (the basis of the RDD API).
  • Other Scala types like case classes and tuples.
  • Effective use of the Spark shell, based on the Scala interpreter.
  • Common mistakes and how to avoid them, like serialization errors.

 


 

**SOLD OUT** Architecting a Data Platform

1-day course

Instructors: John Akred, Stephen O’Sullivan, and Andrew Ray

What are the essential components of a data platform? This tutorial will explain how the various parts of the Hadoop, Spark and big data ecosystem fit together in production to create a data platform supporting batch, interactive, and real-time analytical workloads.

By tracing the flow of data from source to output, we’ll explore the options and considerations for components, including:

  • Acquisition: from internal and external data sources
  • Ingestion: offline and real-time processing
  • Storage
  • Analytics: batch and interactive
  • Providing data services: exposing data to applications

We’ll also give advice on:

  • tool selection
  • the function of the major Hadoop components and other big data technologies such as Spark and Kafka
  • integration with legacy systems

Prerequisites:

Laptop with:

  • 8GB RAM (Preferably 16GB)
  • 10GB free space

With the following tools installed:

  • Git
  • Docker
  • Java 8
  • Maven 3
  • Some kind of editor or Scala IDE
  • (Optional) Scala 2.11

Though not strictly required, participants should have a passing knowledge of the following technologies and concepts:

  • Docker
  • Java / Scala
  • Python
  • Spark
  • APIs

Additionally, to save time, participants should have already pulled the following Docker repos:

  • spotify/kafka
  • cassandra:3.0.13

 


 

**SOLD OUT** Data Science with Apache Spark 2.x

1-day course

Instructor: Zoltan Toth

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.

Prerequisites:

Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark are required. Familiarity with the concept of a DataFrame is helpful.

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics

 


 

**SOLD OUT** Apache Spark Intro for Machine Learning and Data Science

2-day course

Instructor: Brooke Wenig

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in the Spark MLlib DataFrames API. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark, although we will view Spark’s source code a couple of times.

We’ll work through examples using public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the DataFrames MLlib API, and related documentation. These building blocks will enable you to use Apache Spark to solve a variety of data analysis and machine learning tasks.

We will cover most of the same topics covered in the 1-day Data Science with Apache Spark 2.x, but with two days, we will be able to dig into each topic in greater detail and spend more time with hands-on exercises.

Prerequisites:

Some experience coding in Python or Scala, a basic understanding of data science topics and terminology, and some experience using Spark are required. Familiarity with the concept of a DataFrame is helpful.

Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be available in both Python and Scala.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • MLlib Pipelines: Transformers and Estimators
  • Cross validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Tokenizer, Bucketizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler, VectorAssembler, StringIndexer, PolynomialExpansion
  • Clustering, Classification, and Regression
  • K-means, Logistic Regression, Decision Trees, and Random Forests
  • Evaluation Metrics

 


 

**SOLD OUT** Apache Spark Intro for Data Engineering

2-day course

Instructor: Jacob Parr

This 2-day course covers the core APIs for using Spark, fundamental mechanisms and basic internals of the framework, SQL and other high-level data access tools, as well as Spark’s streaming capabilities and machine learning APIs. The class is a mixture of lecture and hands-on labs.

We will cover most of the same topics covered in the 1-day Exploring Wikipedia with Apache Spark, but with two days, we will be able to dig into each topic in greater detail and spend more time with hands-on exercises.

Prerequisites:

  • Some familiarity with Apache Spark is helpful.
  • Basic usage of Spark DataFrames is helpful.
  • Basic programming experience in an object oriented or functional language is required. The class will be taught in a mixture of Python and Scala.

Topics covered include:

  • Spark Overview
  • In-depth discussion of Spark SQL and DataFrames, including:
    • The DataFrames programming API
    • Spark SQL
    • The Catalyst query optimizer
    • The Tungsten in-memory data format
    • The Dataset API, encoders, and decoders
    • Use of the Spark UI to help understand DataFrame behavior and performance
    • Caching and storage levels
  • Overview of Spark internals
  • A brief review of RDDs
  • Graph processing with Graphframes
  • An overview of Spark’s MLlib Pipeline API for Machine Learning
  • Spark Structured Streaming