HDFS on Kubernetes—Lessons Learned

Slides PDF Video

There is growing interest in running Apache Spark natively on Kubernetes (see Spark applications often access data in HDFS, and Spark supports HDFS locality by scheduling tasks on nodes that have the task input data on their local disks. When running Spark on Kubernetes, if the HDFS daemons run outside Kubernetes, applications will slow down while accessing the data remotely.

This session will demonstrate how to run HDFS inside Kubernetes to speed up Spark. In particular, it will show how Spark scheduler can still provide HDFS data locality on Kubernetes by discovering the mapping of Kubernetes containers to physical nodes to HDFS datanode daemons. You’ll also learn how you can provide Spark with the high availability of the critical HDFS namenode service when running HDFS in Kubernetes.

Session hashtag: #SFeco12

Kimoon Kim,  at Pepperdata

About Kimoon

Kimoon joined Pepperdata in 2013. Previously, he worked for the Google Search and Yahoo Search teams for many years. Kimoon has hands-on experience with large distributed systems processing massive data sets.