We can compose data structures like LEGO bricks: a relational database table can be modeled as a `Vector[Employee]`, where we compose the `Vector` collection with the `Employee` class corresponding to a table row. What we don’t see is the resulting memory layout, which is neither compact nor efficient: accessing any value requires dereferencing at least three pointers and each `Employee` object requires its own header.
Data-centric metaprogramming is a technique that allows developers to tweak the memory layout of their data structures while keeping the same high-level interface. For example, we can devise a transformation that stores the `Vector[Employee]` using separate arrays (or vectors) for each component, improving both the execution performance and the heap footprint. In turn, the compiler uses this transformation to optimize our code, using the improved memory layout for `Vector[Employee]`. Thus, there is no need for premature optimization: we write the code using our favorite abstractions and, only when necessary, we improve the memory layout and operations after the fact.
In this talk we will see a prototype Scala compiler plugin that enables data-centric metaprogramming and will go through a few examples. I will then show the challenges of applying this technique to Spark in order to improve the execution performance.
The compiler plugin is developed at github.com/miniboxing/ildl-plugin and the documentation is available on scala-ildl.org.
Vlad is a PhD student in the Scala Team at EPFL, where he’s working on optimizing high-level patterns in the Scala programming language down to fast JVM bytecode. His main project, miniboxing (scala-miniboxing.org), is aimed at compiling generic classes down to very efficient bytecode. Vlad also contributed to the Scala compiler in the areas of specialization, the JVM backend and on the scaladoc tool, where you may have seen the diagrams and the implicit member listings he developed.