SESSION

Improving Apache Spark with S3

Slides PDF Video

Netflix’s Big Data Platform team manages data warehouse in Amazon S3 with over 60 petabytes of data and writes hundreds of terabytes of data every day. At this scale, output committers that create extra copies or can’t handle task failures are no longer practical. This talk will explain the problems that are caused by the available committers when writing to S3, and show how Netflix solved the committer problem.

In this session, you’ll learn:
– Some background about Spark at Netflix
– About output committers, and how both Spark and Hadoop handle failures
– How HDFS and S3 differ, and why HDFS committers don’t work well
– A new output committer that uses the S3 multi-part upload API
– How you can use this new committer in your Spark applications to avoid duplicating data

Session hashtag: #SFdev7

Ryan Blue,  at Netflix

About Ryan

Ryan Blue works on open source projects, including Spark, Avro, and Parquet, at Netflix.