San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Data Science With Apache Spark

Data Science applications with Apache Spark combine the scalability of Spark and the distributed machine learning algorithms. This material expands on the “Intro to Apache Spark” workshop. Lessons focus on industry use cases for machine learning at scale, coding examples based on public data sets, and leveraging cloud-based notebooks within a team context. Includes limited free accounts on Databricks Cloud.
Topics covered include:

  • Data transformation techniques based on both Spark SQL and functional programming in Scala and Python.
  • Predictive analytics based on MLlib, clustering with KMeans, building classifiers with a variety of algorithms and text analytics – all with emphasis on an iterative cycle of feature engineering, modeling, evaluation.
  • Visualization techniques (matplotlib, ggplot2, D3, etc.) to surface insights.
  • Understand how the primitives like Matrix Factorization are implemented in a distributed parallel framework from the designers of MLlib
  • Several hands-on exercises using datasets such as Movielens, Titanic, State Of the Union speeches, and RecSys Challenge 2015.
  • Prerequisites:

    • Intro to Apache Spark workshop or equivalent (e.g., Spark Developer Certificate)
    • Experience coding in Scala, Python, SQL
    • Have some familiarity with Data Science topics (e.g., business use cases)