This session demonstrates a novel visualization technique for quickly understanding the character of big datasets and helping formulate questions required for deeper investigation. By plotting an entire dataset in near real time, this technology enables users to easily identify data scope and patterns and discover issues with quality and collection. Users can interact with their data at multiple resolutions, from high-level overviews to detailed individual data points. High-resolution data visualization harnesses the power of human perception to rapidly identify patterns and form new hypotheses allowing the application to eliminate issues associated with data sampling such as obscuring patterns and outliers. This presentation will explain how we use Apache Spark to generate multi-resolution data plots in real time. By using a tile-based approach to visualization adapted from web maps that partitions big data into a pyramid of fixed-size tiles, our approach allows efficient zooming and panning around a plot within a web browser. Unlike static maps, tile data apportionment is separate from the visual rendering process. This allows tile visualization to be altered in real time, enabling interactions such as the filtering of data values and other visualization techniques. Using Apache Spark we implemented a single-pass binning method that projects data to requested tiles on demand. To optimize throughput, we reduce shuffles by using Spark accumulators to bin and summarize the data within each tile. The real-time exploratory nature of this approach requires on-demand access to cluster computing resources for users of the application. As such, Databricks’ approachable, managed platform allows us to create a compelling and user-friendly application that easily scales with users and their data. To illustrate the power of this technique, we will demonstrate how our approach can be used to interactively explore and make sense of massive geographic, social media and financial datasets.
Rob Harper is Partner, Lead Product Architect at Uncharted Software Inc., and has been building technical platforms and products in the visualization industry for a decade. Over the past number of years Rob has been focusing on development of web-based technology approaches for big data.
Nathan Kronenfeld is a software designer and amateur mathematician, currently living in the Toronto area. He has been working with Spark for almost three years.