A Data Frame Abstraction Layer for SparkR

Slides PDF Video

The data frame is a fundamental construct in R programming and is one of the primary reasons why R has become such a popular language for data analysis. In Spark 1.3, SparkSQL received its own implementation of the data frame concept which extends the SchemaRDD with additional methods that enable more intuitive data manipulation. The similarities between R’s data frame and its analogue in SparkSQL represent a clear opportunity for development within the SparkR project. A natural, and proven, path forward is to create R bindings to the existing Spark DataFrame API and extend it to work with R’s traditional data frame syntax. With DataFrame support, we expose new functionality that allows R programmers to quickly become proficient at working with large, distributed datasets in Spark . In this talk, we describe how this was accomplished and provide a demonstration of the enhanced functionality in SparkR.

Photo of Chris Freeman

About Chris

Chris Freeman is a Content Engineer at Alteryx and works on the development of new tools and macros for the Alteryx Desktop product. He is also one of the contributors to the SparkR project and is one of the primary developers working on Alteryx’s integration of Apache Spark via the R language. Chris holds a Master’s degree in Economics from the University of North Texas and worked as a data analyst in the marketing industry prior to joining Alteryx.

Photo of Shivaram Venkataraman

About Shivaram

Shivaram Venkataraman is a fourth year PhD student at the University of California, Berkeley and works with Mike Franklin and Ion Stoica at the AMP Lab. He is a committer on the Apache Spark project and his research interests are in designing systems for large scale machine-learning. Before coming to Berkeley, he completed his M.S at the University of Illinois, Urbana-Champaign and worked as a Software Engineer at Google.