San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Graph-Based Genomic Integration using Spark
David Tester (Novartis Institutes for Biomedical Research)

Many public and quasi-public institutions are releasing ever larger volumes of genomic data. For research institutes performing drug discovery these datasets can be invaluable. For example, by allowing researchers to sift through 10s-100s of thousands of publicly available samples we can uncover general truths that no research institute could ever arrive at independently. This new knowledge can help us reprioritize or eliminate unnecessary internal experiments, stratify clinical trial cohorts, identify compound targets, as well as meet many other drug discovery goals. However, these large public datasets also present significant challenges arising from the extreme heterogeneity and significant sizes of the datasets involved. These datasets are also problematic owing to the extreme heterogeneity of the kinds of analytics that life sciences researchers want to run on those data. Novartis is attacking these problems with a data integration and analytics layer built primarily on Spark. The advantages of Spark for scalability may be obvious, but what may be less obvious are some ways that Spark can be used to solve these challenges of extreme heterogeneity of both data and analytics. This presentation gives a brief description of our solution and looks at some of the design decisions we made along the way. Perhaps the most unique of these is our decision to represent everything as a giant graph (currently with over a trillion edges) that is stored in HDFS and which permits some useful classes of logical reasoning using Spark. The toolchains we have built around this reasoning graph provide us with the flexibility we need in order to represent diverse genomic data while also meeting the diverse analytics needs of our life sciences researchers.

David Tester is currently an Application Architect at the Novartis Institutes for Biomedical Research (NIBR) and the Technical Lead for the NIBR-wide effort to integrate, datamine, and analyze as many genomic datasets as Novartis can get its hands on. David has a background in Artificial Intelligence and Formal Semantics and a Ph.D. from Oxford University.

Slides PDF |Video