Expanding Apache Spark Use Cases in 2.2 and Beyond

Slides PDF Video

2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.

Matei Zaharia, Co-founder and Chief Technologist at Databricks

About Matei

Matei Zaharia is an assistant professor of computer science at Stanford and Chief Technologist of Databricks. He started the Spark project at UC Berkeley and continues to serve as its vice president at Apache. Matei also co-started the Apache Mesos project and is a committer on Apache Hadoop. Matei’s research work on datacenter systems was recognized through two Best Paper awards and the 2014 ACM Doctoral Dissertation Award.

Michael Armbrust, Software Engineer at Databricks

About Michael

Michael Armbrust is the lead developer of the Spark SQL project at Databricks. He received his PhD from UC Berkeley in 2013, and was advised by Michael Franklin, David Patterson, and Armando Fox. His thesis focused on building systems that allow developers to rapidly build scalable interactive applications, and specifically defined the notion of scale independence. His interests broadly include distributed systems, large-scale structured storage and query optimization. He was the 2011 recipient of the Sevin Rosen Award for Innovation.

Tim Hunter, Software Engineer at Databricks

About Tim

Tim Hunter is a software engineer at Databricks and contributes to the Apache Spark MLlib project. He has been building distributed Machine Learning systems with Spark since version 0.2, before Spark was an Apache Software Foundation project.