Scaling Genetic Data Analysis with Apache Spark

Slides PDF Video

In 2001, it cost ~$100M to sequence a single human genome. In 2014, due to dramatic improvements in sequencing technology far outpacing Moore’s law, we entered the era of the $1,000 genome. At the same time, the power of genetics to impact medicine has become evident: for example, drugs with supporting genetic evidence have twice the clinical trial success rate. These factors have led to an explosion in the volume of genetic data, in the face of which existing analysis tools are breaking down.

Therefore, we began the open-source Hail project ( to be a scalable platform built on Apache Spark to enable the worldwide genetics community to build, share, and apply new tools. Hail is focused on variant-level (post-read) data; querying genetic data, annotations and sample data; and performing rare and common variant association analyses. Hail has already been used to analyze datasets with hundreds of thousands of exomes and tens of thousands of whole genomes.

We will give an overview of the goals of the Hail project and its architecture. The challenge of efficiently manipulating genetic data in Spark has led to several innovations that may have wider applicability, including an RDD-like abstraction for representing multidimensional data and an OrderedRDD abstraction for ordered data, (for example, data indexed by position in the genome). Finally, we will discuss Hail performance and future directions.

Cotton Seed, Sr. Principal Software Engineer at Broad Institute of MIT and Harvard

About Cotton

Cotton Seed is a Sr. Principal Software Engineer and leader of the Hail team at the Broad Institute of MIT and Harvard. Prior to joining the Broad, he did a PhD in Mathematics at Princeton University and spent over a decade building high-performance computing systems with a focus on advanced compiler technology at Connected Components Corp, Intel, and Reservoir Labs, among others.