Spark Summit 2015e
Data Science With Apache Spark
Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud.Data transformation techniques based on both Spark SQL and functional programming in Scala and Python.
Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation.
Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015.
Topics covered include:
- Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
- Experience coding in Scala, Python, SQL
- Have some familiarity with Data Science topics (e.g., business use cases)