SparkR is a light-weight frontend to use Apache Spark from R, which enables users run R analysis tasks on a cluster with the support of Spark. In this talk, we will present some progresses in improving SparkR’s functionality and performance, and one real world application of SparkR about modeling data centers. SparkR’s performance is limited by two factors. First, the slow data transmission between Spark and R process. SparkR pushes the data stream from Spark executor into GNU-R process through pipes. It brings overhead in both data transmission and serialization. Moreover, extra data copies increase the memory footprint on the Spark worker nodes. Users may experience errors of memory allocation failure when they try to persist more datasets in memory cache. We tweak and integrate Renjin – a Java-based R interpreter with the Spark worker executors to allow multiple R session threads to share the dataset in Spark process, which removes data transmission, reduces the memory footprint, and improves serialization by about 10x faster. In case that a third party package used is not supported by Renjin, we fall back to the original data streaming worker with GNU-R interpreter. Second, the slow interpretation in R process also inhibits the performance of SparkR. Although Spark is well design and implemented to schedule and run the whole cluster job, the single task run in each R instance is quite slow due to the interpreted nature of the R virtual machine. The main computation in a single R instance is invoking R’s apply class of operations. R uses looping-over-data interpretation to execute these operations, which has huge interpretation overhead. However, the map style operations of apply class are intrinsically one kind of array operations, and can be vectorized. We combined operation vectorization and data permutation to transform the looping-over-data execution into a sequence of vector function invocations, which dramatically reduced the interpretation overhead, and improved the performance up to 20x in each single R instance. As a result, the whole SparkR system can run much faster without any modifications to the original application code. SparkR gives us the flexibility of modeling in R and let us leverage Spark for high performance. A real-world application of SparkR is virtual machine consolidation in cloud-based data centers. Efficient scheduling of data centers’ VMs can reduce the number of physical servers needed and, in turn, reduce the energy and other capital costs for maintaining the virtualized data center. We use SparkR to model VM workload as time series to extract both low and high frequency features of the workload, and adopt a data-driven approach to achieve efficient pro-active VM scheduling. This work is collaborated with Shuo Yang and Peng Wu from Huawei America Lab.
Hao Lin is a Ph.D. student at School of Electrical and Computer Engineering at Purdue University, West Lafayette, under the supervision of Prof. Samuel Midkiff. His research interests include parallel data system and cloud computing. Hao Lin also worked as software engineer intern on data infrastructure at Huawei Technologies and Google Inc.
Haichuan Wang is a Ph.D student at Computer Science department, University of Illinois at Urbana-Champaign, and works with Professor David Padua. Wang’s research area includes compiler, runtime and parallel computing. He is working on compiler and runtime optimization for dynamic scripting languages currently. Before that, Wang was a Research Staff Member at IBM Research – China, where he was researching on parallel programming models and performance tooling for Java language.