Recipes for Running Spark Streaming Applications in Production

Slides PDF Video

Spark Streaming extends the core Apache Spark to perform large-scale stream processing. It is being rapidly adopted by companies spread across various business verticals – ad monitoring, real-time analysis of machine data, anomaly detections, etc. This interest is due to its simple, high-level programming model, and its seamless integration with SQL querying (Spark SQL), machine learning algorithms (MLlib), etc. However, for building a real-time streaming analytics pipeline, its not sufficient to be able to easily express your business logic. Running the platform with high uptimes and continuously monitoring it has a lot of operational challenges. Fortunately, Spark Streaming makes all that easy as well. In this talk, I am going to elaborate about various operational aspects of a Spark Streaming application at different stages of deployment – prototyping, testing, monitoring continuous operation, upgrading. In short, all the recipes that takes you from “hello-world” to large scale production in no time.

Photo of Tathagata Das

About Tathagata

Tathagata Das is an Apache Spark Committer and a member of the PMC. He is the lead developer behind Spark Streaming, and is currently employed at Databricks. Before Databricks you could find him at the AMPLab of UC Berkeley, researching about datacenter frameworks and networks with professors Scott Shenker and Ion Stoica.