Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
What do you do when you need to update your models sooner than your existing batch workflows provide? At Sharethrough we faced the same question. Although we use Hadoop extensively for batch processing we needed a system to process click stream data as soon as possible for our real time ad auction platform. We found Spark Streaming to be a perfect fit for us because of it’s easy integration into the Hadoop ecosystem, powerful functional programming API, and low friction interoperability with our existing batch workflows.
In this talk, we’ll present about some of the use cases that led us to choose a stream processing system and why we use Spark Streaming in particular. We’ll also discuss how we organized our jobs to promote reusability between batch and streaming workflows as well as improve testability.
Russell Cardullo is a Software Engineer at Sharethrough where he is helping build a native advertising platform. Previously he worked at M*Modal where he built large scale automatic machine learning systems for speech recognition. He’s passionate about data science and DevOps and is focused on delivering data driven products.