San Francisco
June 30 - July 2, 2014

Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.


Spark Summit 2014
End-to-end analytics with Oryx, Spark and MLLib
Sandy Ryza (Cloudera)

Having collected large amounts of data, organizations are keen on data science and big learning. However, transitioning from “the lab” to a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. In addition to scalable and performant model-building capabilities, a production system needs to be able to update and serve models reliably and in real time. The Oryx open source project provides simple, large-scale machine learning infrastructure for both building and serving predictive models. While Oryx has previously relied on custom MapReduce jobs for its model-building component, Apache Spark and its ML library provide a promising alternative. The talk will discuss our recent work in transitioning Oryx’s model-building component to run on Spark and leverage algorithms from MLLib, the reasons why Spark is well suited for the task, and the general anatomy of production large-scale machine learning infrastructure.

Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and contributor to Apache Spark.

Slides PDF |Video