Spark Summit 2013 brought the Apache Spark community together on December 2-3, 2013 at the Hotel Nikko in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Want to can prepare a dataset with MapReduce and Pig, query it with Impala, and fit a model to it with Spark? To run these alongside each other and share resources across them in real time? CDH recently added the capability of dynamically scheduling Impala work alongside MapReduce, centrally managed by YARN. Moving beyond static allocations allows users to think of the resources in terms of workloads which may span processing paradigms. It allows cluster operators to allocate portions of their cluster to units within their organization instead of processing frameworks. Incorporating Spark as a data processing framework alongside MapReduce and Impala, we would like to extend this resource management vision to Spark. The existing Spark on YARN work is a strong step in this direction. Allowing Spark applications to fluidly grab and release resources through YARN will require additional work both in Spark and YARN. For example, resizable containers in YARN and off-heap memory in Spark that can be given back to the OS. The talk will discuss the current state of resource management on Hadoop, how Spark fits in currently, the work that needs to be done to share resources fluidly between Spark and other processing frameworks on YARN, and the kinds of pipelines and mixed workloads that this resource sharing will enable.