Databricks Spark Training

Databricks Spark training is offered as part of the 3-day pass to the Spark Summit, and contains an introductory course in two language options, a data science track and an advanced track. All sessions begin at 9am on October 27, 2015 and will end by 5pm. Lunch will be provided. Workshops will run on Databricks.

Labs will be completed on Databricks for all 3 classes. Students will get a free account on Databricks for the training class.

Trainings offered at Spark Summit Europe:

What’s required for the tutorials?

  • Bring your own Wifi enabled laptop with Chrome or Firefox installed.

 


 

APACHE SPARK ESSENTIALS

Python Instructor: Adam Breindel

Scala Instructor: Brian Clapper

Overview:

Apache Spark Essentials will help you get productive with the core capabilities of Spark, as well as provide an overview and examples for some of Spark’s more advanced features. This full-day course features hands­-on technical exercises so that you can become comfortable applying Spark to your datasets. In this class, you will get hands­-on experience with ETL, exploration, and analysis using real­ world data.

Prerequisites:

Some experience in Python or Scala. This course will be taught two sections (one is in Scala, the other in Python): please enroll in the section corresponding to your preferred language. Some familiarity with big data or parallel processing concepts is helpful.

Topics covered include:

  • Overview: Big Data and Spark
  • Parallel Processing with Resilient Distributed Datasets (RDDs)
  • Transformations and Actions on Data using RDDs
  • Structured Data Processing with DataFrames
  • DataFrames and Spark SQL for Easier, Faster, Scalable Computing
  • Spark Architecture and Cluster Deployment
  • Memory and Performance
  • Overview: Machine Learning and Streaming Data Sources

 

 


 

 

DATA SCIENCE WITH APACHE SPARK

Instructor: Jon Bates

Overview:

The Data Science with Apache Spark workshop will show how to use Apache Spark to perform exploratory data analysis (EDA), develop machine learning pipelines, and use the APIs and algorithms available in Spark ML and Spark MLlib. It is designed for software developers, data analysts, data engineers, and data scientists.

It will also cover parallelizing machine learning algorithms at a conceptual level. The workshop will take a pragmatic approach, with a focus on using Apache Spark for data analysis and building models using ML and MLlib, while limiting the time spent on machine learning theory and the internal workings of Spark.

We’ll work through examples using real public datasets that will show you how to apply Apache Spark to help you iterate faster and develop models on massive datasets. This workshop will provide you the tools so that you can be productive using Spark on practical data analysis tasks and machine learning problems. You’ll learn about how to use familiar Python libraries with Spark’s distributed and scalable engine. After completing this workshop you should be comfortable using DataFrames, the ML and MLlib APIs, and related documentation. These building blocks will enable you use Apache Spark to solve a variety of data analysis and machine learning tasks.

Prerequisites:

Some experience coding in Python, a basic understanding of data science topics and terminology, and some experience using Spark. Brief conceptual reviews of data science techniques will be performed before the techniques are used. Labs and demos will be presented using PySpark, and labs will also be made available in Scala.

Topics covered include:

  • Extract, Transform, Load (ETL) and Exploratory Data Analysis (EDA)
  • DataFrames
  • Feature Extraction and Transformation using MLlib
  • ML Pipelines: Transformers and Estimators
  • Cross ­validation
  • Model Parallel vs Data Parallel
  • Reusing existing code with Spark (examples in Python)
  • Word2Vec, Tokenizer, OneHotEncoder, Normalizer, HashingTF, IDF, StandardScaler,
  • VectorAssembler
  • Clustering, Classification, and Regression
  • K­means, LDA, Linear Regression, Logistic Regression, Random Forests, Gradient Boosted Trees
  • Evaluation Metrics

 



 

ADVANCED: EXPLORING WIKIPEDIA WITH SPARK (TACKLING A UNIFIED CASE)

Instructor: Sameer Farooqui

Overview:

The real power and value proposition of Apache Spark is in building a unified use case that combines ETL, batch analytics, real­-time stream analysis, machine learning, graph processing and visualizations. In class we will explore various Wikipedia datasets while applying the ideal programming paradigm for each analysis. The class will comprise of about 50% lecture and 50% hands on labs + demos.

Prerequisites:

  • ­ Have at least one month of hands on experience working with Apache Spark
  • ­ Understand the difference between the Driver and Executor JVMs
  • ­ Know how the following transformations work: map(), flatMap(), filter(), sample(), distinct(), union(), join(), groupByKey(), reduceByKey(), repartition()
  • ­ How the following actions work: collect(), count(), saveAsTextFile(), first(), take(), reduce()
  • ­ Basic usage of DataFrames
  • ­ Basic programming experience in an object­ oriented or functional language (the class will mostly be taught in Scala)

Topics covered include:

  • Overview: Wikipedia and Spark
  • Analyze data using:
    • DataFrames + Spark SQL
    • RDDs
    • Spark Streaming
    • MLlib (Machine Learning)
    • GraphX
  • Leveraging knowledge of Spark’s Architecture for performance tuning and debugging
  •  How and when to use advanced Spark features:
    • Accumulators
    • Broadcast variables
    • Memory persistence levels
    • Spark UI details