Spark Streaming lets users develop and continuously deliver fresh analytical answers. And it does that with the least amount of overhead when compared to a batch job. But one hard part of Streaming with Spark is in tuning a cluster, especially in high-throughput situations. This talk will draw on the experience of deploying clusters dealing with millions of updates per second to show how to do it better. After understanding the internals of Spark Streaming, we will explain how to scale ingestion, parallelism, data locality, caching and logging. But will every step of this fine-tuning remain necessary forever? As we dive in recent work on Spark Streaming, we will show how clusters can self adapt to high-throughput situations. The audience will take away a better grasp of Streaming internals, and know how to set their cluster for long running jobs. After a quick introduction to Reactive Streams, they will also get how asynchronous back pressure helps make Streaming more resilient.
Gerard is the lead of the Data Processing Team at Virdata.com where he and his team work on building and extending the data processing pipeline for Virdata’s IoT cloud platform. He has a background in Computer Science and is a former Java geek now converted to Scala. Through his career in technology companies like Alcatel-Lucent, Bell Labs and Sony he has been mostly involved in the interaction of back-end services and devices, which has now converged in his IoT focused work at Virdata.
François Garillot worked on Scala’s type system in 2006, earned his PhD from the French École Polytechnique in 2011, and joined Typesafe in 2012, after a brief stint in Internet advertising. He’s been working on the interface between the Scala compiler and its IDE, while nourishing a strong enthusiasm for data analytics in his spare time, until Apache Spark let him fulfill this passion as his main job. He was the very first developer in the world to receive Spark Certification in November 2014.