At the end of day, the only thing that data scientists want is tabular data for their analysis. They do not want to spend hours or days preparing data. How does a data engineer handle the massive amount of data that is being streamed at them from IoT devices and apps, and at the same time add structure to it so that data scientists can focus on finding insights and not preparing data? By the way, you need to do this within minutes (sometimes seconds). Oh… and there are a lot of other data sources that you need to ingest, and the current providers of data are changing their structure.
GoPro has massive amounts of heterogeneous data being streamed from their consumer devices and applications, and they have developed the concept of “dynamic DDL” to structure their streamed data on the fly using Spark Streaming, Kafka, HBase, Hive and S3. The idea is simple: Add structure (schema) to the data as soon as possible; allow the providers of the data to dictate the structure; and automatically create event-based and state-based tables (DDL) for all data sources to allow data scientists to access the data via their lingua franca, SQL, within minutes.
Session hashtag: #SFdev6
David is an Architect in the Data Science and Engineering team at GoPro and the creator of their Spark-Kafka data ingestion pipeline. He has been developing scalable data processing pipelines and eCommerce systems for over 20 years in Silicon Valley. David’s current big data interests include streaming data as fast as possible from devices to near real-time dashboards and switching his primary programming language to Scala from Java after nearly 20 years. He holds a B.Sc. in Computer Science from The Ohio State University.
Hao joined the Data Science and Engineering team at GoPro in 2016 and immediately started cranking out Java and Scala code for use in both the Spark Streaming and batch data pipelines. Hao continuously supports the data publishing needs of the device and software application development teams at GoPro and assists them in utilizing the most appropriate and efficient ways to stream and store their data. He has a M.Sc. in Computer Science from Northeastern University.