Big datasets are growing exponentially, but our needs to get quick interactive responses to our queries remain ever as important. This talk will feature an overview of various components in BlinkDB and introduce a new generalized online aggregation (G-OLA) paradigm in SparkSQL to incrementally process massive amounts of data on clusters of tens, hundreds or thousands of machines while returning approximate answers. More precisely, this new execution model enables SparkSQL to present the user with meaningful approximate results (with error bars) that are continuously refined and updated, at a speed comfortable to the user, while it crunches larger and larger fractions of the whole dataset in the background. This not only alleviates the need for pre-processing the data in advance for a wide range of queries, but also enables the users to observe the progress of a query and control its execution on the fly– enabling a smooth time/accuracy trade-off.
Sameer Agarwal is a Software Engineer at Databricks working at the intersection of large scale distributed systems, databases and statistics. He received his PhD in Databases from UC Berkeley AMPLab where he led the research, design and development of BlinkDB (http://blinkdb.org), an open-sourced, massively parallel approximate query processing framework. He received his B.Tech in Computer Science and Engineering from the Indian Institute of Technology Guwahati where he was awarded the President of India Gold Medal in 2009. He was a Qualcomm Innovation Fellow in 2012-13 and a Facebook Graduate Fellow in 2013-14.
Kai Zeng is a Postdoc Researcher at AMPLab UC Berkeley. His research interest lies in large scale data intensive systems. He received his PhD in Database from UCLA in 2014. Before joining UCLA, he received his Bachelor’s degree in Software Engineering from Zhejiang University, China in 2009. He has won several awards, including SIGMOD 2012 best paper award and SIGMOD 2014 best demo award.