ggplot2 is one of the most popular data visualization packages for R, which makes it easy to produce high-quality graphs using data represented in R data.frame. However, ggplot2 is not suitable for big data visualization for the following reasons. First, the maximum data size it can handle is limited by the physical memory size since R Virtual Machine (RVM) attempts to keep the entire data.frame in memory. Second, even if the data set fits in memory, it often takes a long time to import it from a file, partly due to the overhead of format conversion. Finally, ggplot2 does not effectively utilize abundant computing resources offered by today’s parallel/distributed machines as the package itself is not parallelized. In this presentation, we introduce ggplot2.SparkR, an R package for scalable visualization of big data represented in Spark DataFrame. ggplot2.SparkR is an extension to the original ggplot2 package and can seamlessly handle both R data.frame and Spark DataFrame with no modifications to the original API. When invoked, a plot function in ggplot2.SparkR first checks the type of input data. If the input type is Spark DataFrame, heavyweight data processing stages are offloaded to the Spark backend using the SparkR API, and the final results will be collected and coerced into an R data.frame. Otherwise, the input data will go through the original data processing path of ggplot2 on RVM. Finally, a common backend stage for plotting will draw the graph to preserve the same look-and-feel for both cases. ggplot2.SparkR requires no additional training for existing R users who are already familiar with ggplot2 and allows them to benefit from powerful distributed processing capabilities of Spark for efficient visualization of big data. To demonstrate this we plan to show a demo with a detailed comparison between ggplot2 and ggplot2.SparkR graphics.
His research areas include computer architecture/compilers, parallel programming, and computer security, and he has co-authored more than 30 research papers in leading CS conferences and journals with over 2,500 citations. His work has been recognized with various awards, including ACM ASPLOS Most Influential Paper Award (2014), HiPEAC Paper Award (2012), and IEEE PACT Top Paper (2010). Before joining SKKU, he was a research associate at Princeton University, where he conducted research on multicore software optimization. He received his M.S. degree in Electrical Engineering from Stanford University and Ph.D. degree in Computer Science from MIT.
Sangoh Jeong works for SK Telecom (SKT) in Korea, where he’s in charge of a SparkR project and is involved in other projects related to Operational Intelligence for Cloud systems. He got his Ph.D. in Electrical Engineering from Stanford University in 2006. Prior to joining SKT, he worked for Samsung Information Systems America, HP Labs., LookSmart, Ricoh Innovations in California, and LG Electronics, Samsung Electronics in Korea. His research interests include machine learning, Big Data analytics, IoT, and computer vision. He’s also interested in Spark MLlib.
Before coming to SKKU, she completed her B.S. degree in Electronic and Communication Engineering from Yanbian University of Science & Technology, China, in 2013. Upon graduation she worked as a teaching staff at the same university for two years. Her research interests include cloud computing and big data.