R is the latest language added to Apache Spark, and the SparkR API is slightly different from PySpark. With the release of Spark 2.0, the R API officially supports executing user code on distributed data. This is done through a family of apply() functions. In this talk I will give an overview of this new functionality in SparkR. Using this API requires some changes to regular code with dapply(). This talk will focus on how to correctly use this API to parallelize existing R packages. Most important topics of consideration will be performance and correctness when using the apply family of functions in SparkR.
Hossein Falaki is a software engineer and data scientist at Databricks, working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with a Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).