Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
In this talk we will present the underlying abstraction called Distributed DataFrame (DDF) that powers the rapid construction of applications like Adatao pInsights and pAnalytics, directly on top of Spark RDDs. This has enabled Adatao to provide easy interfaces such as Natural Language, R, and Python into the underlying Spark/Shark engine.
DDF’s goal is to make the Big-Data API as simple and accessible to scientists and engineers as the equivalent “”small-data”” RDBMS API. The core idea behind DDF is to combine decades of wisdom in (a) RDBMS, (b) R Data Science, and (c) Distributed Computing, and provide the API user with a simple yet rich set of idioms such as friendly SQL queries, easy data table filtering and projections, transparent handling of missing data, quick access to machine-learning algorithms, and yet with direct access to the underlying Spark RDD representation as needed.
DDFs bring huge benefits to their users: many of the well-established idioms of RDBMS and data-science are accessible within one or two lines of code, yielding high analytic application-development productivity.
DDF’s architecture is componentized and pluggable by design, even at run-time, making it easy for users to replace or extend any component (“”handler””) at will without having to modify the API or ask for permission.
Christopher is co-founder & CEO of Adatao. He is a former engineering director of Google Apps, a recipient of the Google Founders’ Award, a Stanford PhD who co-founded two enterprise startups with successful exits, and was a professor and co-founder of the Computer Engineering program at HKUST. He graduated from U.C. Berkeley summa cum laude. Christopher has extensive experience building technology companies that solve enterprise business challenges.