Machine Learning workflows involve complex sequences of data transformations, learning algorithms, and parameter tuning. Spark ML Pipelines, introduced in Spark 1.2, have grown into a powerful framework for developing ML workflows. This talk will cover basic Pipeline concepts and then demonstrate their usage:
(1) Building: Pipelines simplify the process of specifying a ML workflow.
(2) Debugging: Pipelines and DataFrames permit users to inspect and debug the workflow.
(3) Tuning: Built-in support for parameter tuning helps users optimize ML performance.
Joseph Bradley is a Spark Committer working on MLlib at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.