The Cornell Lab of Ornithology collects millions of bird observations through the citizen-science project eBird. These data provide a unique source of year-round information to understand the distribution, abundance and movements of birds across large geographic areas and over long periods of time, information essential for broad-scale conservation planning. However, to extract reliable information from these data requires statistical models to fill spatiotemporal gaps and control for heterogeneity in the data. The size of the data combined with the complexity of the modeling process presents scaling and computational resource challenges.
This session will describe how Apache Spark has been used to overcome these challenges and dramatically improve the computational efficiency of a large-scale species distribution modeling workflow, enabling researchers to generate more results, more quickly, to advance broad-scale conservation planning and ecological research. Auer will discuss a real-world application that brings together several common big data challenges, and will highlight both the technical challenges of using mixed models to deal with bias and gaps in the data and the computational challenges of scaling computationally expensive models. Technically, this talk will focus on the computational gains made by transitioning from R-based parallelization to Spark and challenges faced in the process, including development, tuning and scaling, and aspects of future work where the lab will look to further capitalize on Spark for improving their workflow.
Session hashtag: #SFexp8
Tom Auer is a GIS Developer with the Information Science group at Cornell’s Lab of Ornithology. He blends his backgrounds in biology and geography with parallel and cloud computing to help develop and improve the Lab’s species distribution modeling efforts, from data preparation to workflow automation and visualization, focusing on implementation through R and Python. He uses Spark to improve modeling workflow efficiency.