The SparkR project provides language bindings and runtime support to enable users to run scalable computation from R using Apache Spark. SparkR has an active set of contributors from many companies and a number of recent developments have improved performance and usability. Some of the improvements include
(a) a new R to JVM bridge that enables easy deployment to YARN clusters,
(b) serialization-deserialization routines that enable integration with other Spark components like ML Pipelines,
(c) complete RDD API with support coming for DataFrames and
(d) performance improvements for various operations including shuffles.
This talk will present an overview of the project, outline some of the technical contributions and discuss new features we will build over the next year. We will also present a demo showcasing how SparkR can be used to seamlessly process large datasets on a cluster directly from the R console.
Shivaram Venkataraman is a fourth year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing systems for large scale machine-learning. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google.
Rui (Ray) Sun is a software engineer at Intel big data team, focusing on contributions to the Spark ecosystem. He is a SparkR/Shark/HIVE contributor. Prior to that, he had been working on firmware development for Intel platforms for 8 years. His interest include distributed computing, systems programming.