Visualizing large data is challenging. There are more data points than possible pixels, and manipulating distributed data can take a long time. To address these challenges one solution is using custom rendering engines. But we think with Spark we can apply off the shelf and open source visualization tools such as, D3, Matplotlib, and ggplot, to very large data. This approach has several benefits. First, data scientists are already familiar with these tools. Second, the output of these tools can be readily shared with others on the web. Finally, separating data manipulation from rendering enables users to freely chose the best tool for their job. For example, if a graph needs to be interactive D3 is a better choice than Matplotlib. Apache Spark comes ready for this task. It enables interactive analysis of big data by reducing query latency to the range of human interactions through caching. Additionally, Spark’s unified programming model and diverse programming interfaces enable smooth integration with popular visualization tools. We can use these to perform both exploratory and expository visualization over large data. In this talk we will introduce the relevant Spark API for sampling and manipulating large data. We will also demonstrate how the API can be integrated with D3 and Matplotlib for end-to-end data visualization.
Hossein Falaki is a software engineer at Databricks working on the next big thing. Prior to that he was a data scientist at Apple’s personal assistant, Siri. He graduated with Ph.D. in Computer Science from UCLA, where he was a member of the Center for Embedded Networked Sensing.