Apache Spark is an active member of the broad open source community beyond the Apache Foundation. Every day thousands of users combine capabilities of Spark with other open source software to get their job done. This is not by chance. Spark has been designed to behave well with existing ecosystems. For example, PySpark is designed to work well with Pandas, Numpy and other python packages. In this talk we will present an analysis of libraries and open source tools that are commonly used along with Spark in JVM, Python and R ecosystems. Our quantitative results are based on usage of thousands of Spark users. We will show the Spark Summit attendees what the rest of their community finds useful to complement the power of Spark and what parts of Spark API is used in conjunction with most popular open source libraries.
Hossein Falaki is a software engineer and data scientist at Databricks, working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with a Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing (CENS).