Big data never stops and neither should your Spark jobs. They should not stop when they see invalid input data. They should not stop when there are bugs in your code. They should not stop because of I/O-related problems. They should not stop because the data is too big. Bulletproof jobs not only keep working but they make it easy to identify and address the common problems encountered in large-scale production Spark processing: from data quality to code quality to operational issues to rising data volumes over time.
In this session you will learn three key principles for bulletproofing your Spark jobs, together with the architecture and system patterns that enable them. The first principle is idempotence. Exemplified by Spark 2.0 Idempotent Append operations, it enables 10x easier failure management. The second principle is row-level structured logging. Exemplified by Spark Records, it enables 100x (yes, one hundred times) faster root cause analysis. The third principle is invariant query structure. It is exemplified by Resilient Partitioned Tables, which allow for flexible management of large scale data over long periods of time, including late arrival handling, reprocessing of existing data to deal with bugs or data quality issues, repartitioning already written data, etc.
These patterns have been successfully used in production at Swoop in the demanding world of petabyte-scale online advertising.
Sim Simeonov is the founding CTO of Swoop, a startup that brings the power of search advertising to content. Previously, Sim was the founding CTO at Ghostery, the platform for safe & fast digital experiences, and Thing Labs, a social media startup acquired by AOL. Earlier, Sim was vice president of emerging technologies and chief architect at Macromedia (now Adobe) and chief architect at Allaire, one of the first Internet platform companies. He blogs at blog.simeonov.com, tweets as @simeons and lives in the Greater Boston area with his wife, son and an adopted dog named Tye.