Demystifying DataFrame and Dataset

Slides PDF Video

Apache Spark achieves high performance with ease of programming due to a well-balanced design between ease of usage of APIs and the state-of-the-art runtime optimization. In Spark 1.3, DataFrame API was introduced to write a SQL-like program in a declarative manner. It can achieve superior performance by leveraging advantages in Project Tungsten. In Spark 1.6, Dataset API was introduced to write a generic program, such as machine learning in a functional manner. It was also designed to achieve superior performance by reusing the advantages in Project Tungsten. The differences between DataFrame and Dataset are not fully understood in the community, and it is worth understanding these differences because it is becoming popular to write programs in Dataset and for a transition of programs from RDD to Dataset.

This session will explore the differences between DataFrame and Dataset using programs that performs the same operations (e.g. filter()). Dr. Ishizaki will give several comparisons from levels of source code, SQL execution plans, SQL optimizations, generated Java code, data representations and runtime performance. He will show performance difference of the programs between DataFrame and Dataset, and will identify the cause of the difference. He will also explain opportunities and approaches to improve performance of Dataset programs by alleviating some of issues.

Learn to understand the differences between DataFrame and Dataset from several views; get to know performance differences of programs, which perform the same computation, by using the DataFrame API and the Dataset API; and understand opportunities to improve performance of programs in the Dataset API.

Session hashtag: #SFdev20

Dr. Kazuaki Ishizaki, Research Staff at IBM

About Dr.

Dr. Kazuaki Ishizaki is a research staff member at IBM Research – Tokyo. He has over 20 years of experience conducting research and development of dynamic compilers for Java and other languages. He is an expert in compiler optimizations, runtime systems, and parallel processing. He has been working for IBM Java just-in-time compiler and virtual machine from JDK 1.0 to Java 8. His research has focused on how system software can enable programmers to transparently exploit SIMD/GPUs in high-level languages and frameworks. He makes many contributions to Spark, especially in Catalyst and Tungsten. He is an ACM senior member.