Machine Learning (ML) workflows often involve a sequence of processing and learning stages. For example, classifying text documents might involve cleaning the text, transforming raw text into feature vectors, and training a classification model. Realistic workflows are often even more complex, including cross-validation to choose parameters and combining multiple data sources. With most current tools for ML, it is difficult to set up practical pipelines. Many ML tools either (a) lack support for distributed computation, or (b) have no utilities to help users assemble pipelines. Spark’s MLlib already supported distributed ML. This talk discusses Pipelines and related concepts introduced in Spark 1.2 which provide a simple API for users to set up complex ML workflows. We use a running example of text processing to demonstrate the simple but flexible API. Key Concepts: We introduce key concepts which help support complex ML workflows. (1) Datasets: Tabular data forms like DataFrames are widely used to build ML algorithms in single machine settings. We adopt SchemaRDD as a distributed ML dataset. SchemaRDDs allow us to handle diverse types while improving integration with the rest of Spark. (2) Pipelines: ML workflows are modeled as Pipelines, which consist of a sequence of stages. Each stage transforms input data to produce output for succeeding stages. Our API makes workflows easy to assemble, inspect and debug. (3) Pipeline Stages: We define two kinds of stages: Transformers and Estimators. A Transformer transforms a dataset; e.g., HashingTF transforms raw text into feature vectors, and a LogisticRegressionModel transforms feature vectors into predictions. An Estimator (e.g., LogisticRegression) fits on a dataset and produces a Transformer (e.g., LogisticRegressionModel). Model Selection: In addition to providing tools for structuring complex workflows, Pipelines and other tools introduced in Spark 1.2 help support model selection, i.e., using data to choose parameters for Transformers and Estimators. ParamGridBuilder and CrossValidator provide simple APIs for this critical ML task. Implementation: Pipelines heavily rely on SchemaRDD. Schemas allow the flexibility of weak types, plus the safety of runtime schema validation (before actually running Pipelines). Non-native SQL types (MLlib Vectors) are currently supported by an internal User-Defined Type API in Spark SQL. Future Development: During the next release cycle, MLlib will gain more features for users (such as more algorithms and a Python API for Pipelines) and for developers (such as supporting abstractions). This is joint work with Xiangrui Meng (Databricks), Evan Sparks (UC Berkeley), and Shivaram Venkataraman (UC Berkeley).
Joseph Bradley is a Software Engineer at Databricks, working on Spark MLlib. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon U. in 2013. His research included probabilistic graphical models, parallel sparse regression, and aggregation mechanisms for peer grading in MOOCs.