SESSION

Productionizing Spark and the Spark REST Job Server

Slides PDF Video

This is a two-part talk. The first part covers general deployment, configuration, and application running tips for Apache Spark, from my personal experience setting up and running Spark clusters since the early days of version 0.9. Should one deploy using Spark Standalone mode, Mesos, or YARN? What about Datastax DSE, and other options such as EMR? What are important considerations when configuring Spark, and working with jars and dependencies? We will cover all this and more, including tips for running and debugging Spark applications. The Spark Job Server is a leading option for running and managing Spark jobs as a REST service. With it, you get automatic job status and configuration logging to a database. We go into depth into using the Job Server, in particular as a way to share Spark RDDs amongst logical jobs for low-latency queries. Another interesting use case is SQL/DataFrame queries on Spark Streaming data. Learn about productionizing Spark and running it as a REST service with the Spark Job Server!

Photo of Evan Chan

About Evan

Evan loves to design, build, and improve bleeding edge distributed data and backend systems using the latest in open source technologies. He has led the design and implementation of multiple big data platforms based on Storm, Spark, Kafka, Cassandra, and Scala/Akka, including a columnar real-time distributed query engine. He is an active contributor to the Apache Spark project, a Datastax Cassandra MVP, and co-creator and maintainer of the open-source Spark Job Server. He is a big believer in GitHub, open source, and meetups, and have given talks at various conferences including Spark Summit, Cassandra Summit, FOSS4G, and Scala Days.