San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Towards Modularizing Spark Machine Learning Jobs
Lance Co Ting Keh (Box)

Spark powers machine learning at Box. In this presentation, we will discuss constructs we have built around Spark that have helped us increase our technical velocity by enforcing type safety, modularity, best Spark-programming practices and functional correctness on RDD transformations and actions.

I. The typical ML job
Inherent in writing machine learning algorithms at scale are difficulties such as handling malformatted data and a large number of predictors, keeping track of the DAG of transformations, testability and code reusability. We will begin by showing examples of typical ML algorithms and noting that they generally adhere to the following pattern: Reading Input from Sources -> Parsing and Deserializing -> Accumulating -> Dedup -> Normalization -> -> Serialization -> Output. We will then go through each one of these actions and examine the intricacies that must be handled at each step (e.g. Reading Input involves handling fatal and nonfatal data errors, figuring out where the desired data lives; Output involves time stamping and checkpointing jobs, writing to the right datastores (MySql, HDFS, etc); and so on). Having established the typical pattern of ML algorithms, we will then turn to the main challenge: how to modularize and reuse the logic for each of these stages across all our jobs while enforcing correctness and type safety.

II. Modularizing with Sparkles
We will introduce the notion of a Sparkle, which represents a single unit of logical work in our universe. Sparkles adopt the Reader Monad pattern ~ type Sparkle[A] { def run(sc :SparkContext):A }. The monadic nature of Sparkles allow us to chain actions together via for-comprehensions and sequence jobs (List[Sparkle[A]] => Sparkle[List[A]]). We will discuss the benefits of ensuring each Sparkle has well defined inputs and outputs, and then cite examples of the actions described above written as Sparkles.

III. Launchers Chain Together Sparkles
With Sparkles in place, each Spark job can be defined by chaining together Sparkles to perform specific work. Any of the algorithms described above can be written modularly as a chain of Sparkles. A simplified example of a typical ETL job is shown here:

for {
logPreRdd datastoreRdd accumRdd dupSnapshotRdd snapShotRdd _ } yield()

We will carry out a side-by-side comparison of the type safe and modular Sparkle approach against the conventional way Spark jobs are written.
Additionally, we will discuss Sparkle Metadata, a notion we created to address the fact that Sparkles are job agnostic. Simply put, Sparkle Metadata is a bag of key-values that is passed down the chain in a Sparkle for-comprehension. This metadata includes job specific tags that are used for logging, alerts and the like.
The high level benefits of this framework include the compile time provability of the functional correctness of a job, DRY-ness both in terms of shared Sparkle logic (inheritance hierarchy of Sparkles) and reusing Sparkles across jobs, testability and ease of debugging, and the enforcement of best practices in writing Spark jobs.

Lance Co Ting Keh is a Sr. Software Engineer and founding member of machine learning at Box, where they are building a platform that makes it ridiculously easy for enterprises to share, manage and create content. Lance is an academic at heart who is currently interested in distributed systems that can support large scale machine learning. His past papers and patents span the fields of nonlinear dynamics, fault tolerant computing and energy absorbing materials. He has a BS and MS from Duke University.

Slides PDF |Video