Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
At Ooyala we must process over two billion video events a day and provide rich, near real-time, and always-available analytics to thousands of customers. Spark Streaming is core to our state of the art ingestion pipeline. In developing this system we have encountered and resolved a large number of undocumented challenges which we would like to share: What are some of the challenges and lessons from productionizing a Spark Streaming pipeline over YARN? How do you ensure 24/7 availability and fault tolerance? What are the best practices for Spark Streaming and its integration with Kafka and YARN? How do you monitor and instrument the various stages of the pipeline? We will dive into all these topics and more.
Issac Buenrostro is a software engineer at Ooyala creating a new ingestion system for video analytics events using Spark, YARN, Thrift, and Parquet. Before Ooyala he obtained a Bachelors degree from MIT and a Masters from Stanford in applied mathematics working on high performance scientific computing.
Arup Malakar works on the next gen ETL pipeline of analytics at Ooyala and uses Spark Streaming, YARN and Kafka for it. Before Ooyala he contributed to apache Hive, HCatalog and helped built the hosted platform for processing feeds at Yahoo! Arup holds a Bachelor in Computer Science from IIT, Guwahati.