Apache Spark performance on SQL and DataFrame/DataSet workloads has made impressive progress, thanks to Catalyst and Tungsten, but there is still a significant gap towards what is achievable by best-of-breed query engines or hand-written low-level C code on modern server-class hardware. This session presents Flare, a new experimental back-end for Spark SQL that yields significant speed-ups by compiling Catalyst query plans to native code.
Flare’s low-level implementation takes full advantage of native execution, using techniques such as NUMA-aware scheduling and data layouts to leverage ‘mechanical sympathy’ and bring execution closer to the metal than current JVM-based techniques on big memory machines. Thus, with available memory increasingly in the TB range, Flare makes scale-up on server-class hardware an interesting alternative to scaling out across a cluster, especially in terms of data center costs. This session will describe the design of Flare, and will demonstrate experiments on standard SQL benchmarks that exhibit order of magnitude speedups over Spark 2.1.
Session hashtag: #SFr10
Tiark Rompf is an Assistant Professor at Purdue University. His work focuses on advanced compiler technology for big data systems, and associated language support. From 2008 to 2014 he was a member of the Scala team at EPFL, where he made various contributions to the Scala language and toolchain (delimited continuations, efficient immutable data structures, compiler speedups, type system work). From 2012 to 2014 he was a Principal Researcher at Oracle Labs. His work has been featured as Research Highlight in CACM, received a Best Paper Award at VLDB, and an NSF CAREER Award..