Since its introduction one year ago, Spark SQL has proven to be a highly effective way to speed up existing SQL workloads by leveraging the power of Spark. Spark SQL’s built-in support for reading data from existing Hive warehouses allows HQL users to achieve better performance simply by switching query engines. However even non-SQL workloads can often benefit from the automatic optimizations that Spark SQL can perform. At the core of Spark SQL is the notion of a SchemaRDD, which improves on traditional RDDs by giving them knowledge of how best to manipulate the data that they hold. In addition to rich querying, this structure makes it possible to more efficiently cache and shuffle the data during computations. Furthermore, with the addition of the data sources API, Spark SQL makes it easier to compute over structured data sourced from a wide variety of formats, including Parquet, JSON, Apache Avro, and more. This talk will show examples of how even traditional Spark jobs can benefit from using SchemaRDDs to capture richer information about the structure of the data being processed. It will also describe how the built-in JDBC server opens the loosely structured world of big data to traditional BI tools. Finally, it will reveal the road map for Spark SQL and where the project is heading.
Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization.