Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Having collected large amounts of data, organizations are keen on data science and big learning. However, transitioning from “the lab” to a production-ready large-scale operational analytics system remains a difficult and ad-hoc endeavor, especially when real-time answers are required. In addition to scalable and performant model-building capabilities, a production system needs to be able to update and serve models reliably and in real time. The Oryx open source project provides simple, large-scale machine learning infrastructure for both building and serving predictive models. While Oryx has previously relied on custom MapReduce jobs for its model-building component, Apache Spark and its ML library provide a promising alternative. The talk will discuss our recent work in transitioning Oryx’s model-building component to run on Spark and leverage algorithms from MLLib, the reasons why Spark is well suited for the task, and the general anatomy of production large-scale machine learning infrastructure.
Sandy Ryza is an engineer on the data science team at Cloudera. He is a committer on Apache Hadoop and contributor to Apache Spark.