Persisting data from Amazon Kinesis using Amazon Kinesis Firehose is a popular pattern for streaming projects. However, building real-time analytics on these data introduces challenges, including managing the format, size and frequency of the files created.
This session will present an end-to-end use case for deploying machine learning streaming analytics at-scale using Structured Streaming on Databricks. We will deploy a high-volume Kinesis producer, persist the data to S3 using Kinesis Firehose, partition and write the data using Parquet, create a machine learning model and, finally, query and visualize the data in real time.
Key takeaways include:
– Create a Kinesis producer
– Persist to S3 using Kinesis Firehose
– ETL, machine learning, and exploratory data analysis using Structured Streaming
Session hashtag: #SFexp6
Caryl Yuhas is a Solutions Architect at Databricks, where she provides consultative and technical support for companies looking to optimize their data pipelines with Apache Spark and Databricks. Previously, as a Product Manager at MediaMath, Caryl worked on a solution for measuring the incremental return of advertisers’ digital media investments. It was during this project that she first began to work with distributed data processing and developed a passion for Spark and cloud computing. She is an alumna of the University of Pennsylvania, where she received a B.S.E. in Chemical and Biomolecular Engineering.
Myles Baker is a Solutions Architect who helps large enterprises develop Apache Spark applications using Databricks. His work on image processing software at NASA introduced him to distributed computing, and since then he has helped clients build data science models and applications at-scale spanning multiple industries. He received a B.S. in Applied Mathematics from Baylor University and an M.S. in Computer Science from the College of William and Mary.