The DataFrame API of Spark SQL allows the easy integration of external sources such as SQL Databases, CSV files or Avro sources. In addition to this, Spark uses the computational capabilities of sources to ‘pushdown’ projects as well as filters on the data source. This prunes unnecessary data right in the source, reducing evaluation time on Spark level. However, sources such as SQL Engines also allow the evaluation of more complex parts of the logical plan. Enabling such capabilities of the sources promises a huge performance boost: For example, evaluating aggregates or joins directly in the source reduces the amount of copied data dramatically. This is challenging because this requires rewriting of the logical plans depending on the used features and the partitioning of the data. In this talk we will present an extension of the data source API allowing the pushdown of arbitrary elements of the logical plan. This includes that sources could announce their capabilities if they supported by the underlying system. In addition to that, we implemented the extended data source API on top of HANA as well as a newly developed lightweight inmemory processing engine developed at SAP. We show that the extension improves performance of Spark SQL in combination with HANA and the lightweight engine. In addition to that, we give insights in how the functionality can be used for arbitrary data sources.
Stephan Kessler is a developer in a Research and Development Team at SAP Walldorf. He is working on the integration of SAPs query execution engines in the Spark eco-system. His main goals are improving the speed of Spark processing even more and bringing new features to the SQL extension. Before joining SAP, he did his PhD and his Diploma (M.Sc.) at the Karlsruhe Institute of Technology at the chair of database and information systems. Before joining the Big Data community his research interest covered privacy in databases as well as sensor networks.
Santiago Mola is a Big Data Developer at Stratio. He works on projects with Apache Spark Streaming and SQL and is currently helping to build the integration of Apache Spark with SAP HANA Vora. Santiago has worked previously as a researcher in the Machine Learning field and has contributed to Open Source projects for 9 years.