Databricks Spark training is offered as part of the 2-day pass to the Spark Summit, and contains an introductory, advanced devops, and data science track. All sessions begin at 9am on March 19, 2015 and will end by 5pm. Lunch will be provided. Workshops will run on Databricks Cloud.
What’s required for the tutorials?
Bring your own laptop
Wifi enabled laptop with browsing capability
Reasonably current hardware (+2GB)
MacOSX, Windows, Linux — installed and functioning
Temporarily remove any corporate security controls that prevent use of conference network
Have Java JDK 6/7/8 installed
Have Python 2.7 installed
Note: Do not install Spark with Homebrew or Cygwin
The Introduction to Apache Spark workshop is for developers to learn the core Spark APIs in-depth. This full-day course features hands-on technical exercises to get up to speed using Spark for data exploration, analysis, and building Big Data applications. Topics covered include:
Overview of Big Data and Spark
Installing Spark Locally
Using Spark’s Core APIs in Scala, Java, Python
Building Spark Applications
Deploying on a Big Data Cluster
Combining SQL, Machine Learning, and Streaming for Unified Pipelines
Some experience coding in Python, Java, or Scala, plus some familiarity with Big Data issues/concepts.
Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud. Topics covered include:
Data transformation techniques based on both Spark SQL and functional programming in Scala and Python.
Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015.
Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
Experience coding in Scala, Python, SQL
Have some familiarity with Data Science topics (e.g., business use cases)
Instructor: Sameer Farooqui, Client Services Engineer at Databricks
The slides for this class can be found here, licensed under a Creative Commons license.
(note that actual slides used in class are subject to change)
Have you opened the Spark shell, ran some transformations/actions and now want to take your Spark knowledge to the next level? This advanced course is for students who already have some beginner-level familiarity with Spark’s architecture and developer API (or have gone through the prerequisites listed below). Students attending can expect a fast paced, vendor agnostic, and very technical class on Spark.
This class will have about 80% lecture and 20% hand-on labs. All students will get a temporary account on Databricks Cloud to try out the 2 Spark Core DevOps labs (one before lunch, one after). Also, additional labs will be given for students to complete after class.
The focus of the class will be understanding the deep architecture of Spark Core to give students a solid foundation for learning the higher level libraries after class. However there will be some coverage of Spark SQL (~20 mins) and Spark Streaming (~45 minutes) as well. We will not be discussing MLlib or GraphX in this course at all.
After this class, you will be able to:
Explain how Spark runs differently on Hadoop versus Cassandra
Understand how a Spark application breaks down into: Jobs -> Stages -> Tasks
Write more performant Spark code by chaining transformations in intelligent ways, using custom serialization, and tweaking Spark core settings
Explain the pros + cons of writing Spark code in Scala vs. Python vs. Java
Explain how a developer’s Spark application code executes from an operational perspective in a large cluster (DevOps)
Spark Architecture on various Resource Managers
Resource Managers covered: Local mode, Standalone Scheduler, YARN, Mesos
How these JVMs interact with each other: Driver, Executor, Worker, Spark Master
Using Spark Submit to submit a Spark application to any of the resource managers
Considerations for how much RAM to assign each JVM
How to think about set the # of task slots in each Executor based on different workload requirements
How to deal with Out of Memory errors in Spark (& identifying if the OOM is occurring in the Driver or Executor)
Resilient Distributed Datasets
Understanding lineage: Narrow vs. Wide dependencies
Demystifying how a Spark application breaks down into Jobs -> Stages -> Tasks
How to choose between the 7 different persistence levels for caching RDDs
Becoming aware of the ~20 types of RDDs (HadoopRDD, MappedRDD, FilteredRDD, CassandraRDD, SchemaRDD, etc)
Using the “Preserves Partitioning” parameter to make shuffles in Spark run faster
When and why to use Broadcast Variables and Accumulators
Architecture of how the new bittorent implementation of Broadcast Variables works
Use case: How the 100-TB sort competition was won
Architecture of the 2 shuffle implementations in Spark 1.2: HASH and SORT
Details of the EC2 hardware and software settings used for winning the 100-TB sort benchmark in 2014 (sortbenchmark.org)
External Shuffle Service
Netty Native Transport (aka zero-copy)
Architecture of how Python user code gets shipped via the Driver and Executor JVMs to actual Python PIDs running in the distributed cluster
Explanation of how user code runs in Python PIDs, but MLlib/SQL/Shuffle runs directly in Executor JVMs
When and why to choose different implementations of Python (CPython vs PyPY)
Performance considerations when developing in PySpark
Note: This will be a relatively short section
Leveraging SchemaRDDs and Data Frames
Comparing the architecture of Apache Hive vs. Spark SQL
Using Parquet Files
Note: This will also be a relatively short section
Understanding the Streaming Architecture: How DStreams break down into RDD batches
How receivers run inside Executor task slots to capture data coming in from a network socket, Kafka or Flume
Increasing throughput performance in Spark Streaming via multiple receivers and the Union transformation
Sliding window operations on DStreams
Also, remember to bring your laptop to class. There is no need to have Spark pre-installed or Java/Python installed on your laptop. You just need Chrome or Firefox to access Databricks Cloud.
All students are strongly advised to complete the following prerequisites BEFORE class starts:
1 hour: Watch this introductory video on how Spark works:
30 mins: Skim through the official Spark 1.2 documentation. Click on the links in the top navigation bar under Programming Guides, API docs, Deploying and More to get familiar with the breadth of topics. You don’t need to read the entire documentation here, but you should be familiar with what the different sections are.
1 hour: Install Spark (either via an Apache download or use CDH/HDP/MapR/DSE) and using the Python or Scala shells run through some basic transformations (map, filter, distinct, etc) and actions (collect, count, reduce, first, saveAsTextFile, etc).
If you have media questions, or would like to find out about sponsoring a Spark Summit, please contact firstname.lastname@example.org.
Apache, Apache Spark, Spark, and the Spark logo are trademarks of the Apache Software Foundation. The Apache Software Foundation has no affiliation with and does not endorse the materials provided at this event.