San Francisco
June 30 - July 2, 2014

Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.


Spark Summit 2014
Geotrellis: Adding Geospatial Capabilities to Spark
Ameet Kini (DigitalGlobe), Rob Emanuele (Azavea)

Geotrellis is an Apache version 2-licensed pure-Scala open-source project for enabling geospatial processing at both web-scale and cluster-scale. The web-scale use-case [1, 2] is fairly mature, and allows analysts to run sub-second response time operations ranging from simple raster math (*, +, /, etc.) to fairly sophisticated raster and vector operations like Kernel Density and Cost Distance. The cluster-scale use-case is a relatively new effort to port Geotrellis’ rich library of algorithms to Spark, thus allowing for batch as well as interactive processing on large geospatial datasets such as satellite imagery. The effort is a collaboration between Azavea, the inventors of Geotrellis, and DigitalGlobe, a satellite imagery provider who had built a Hadoop map/reduce-based geospatial library [3], and recognized early (back in Spark 0.6.0) the potential of Spark to be the platform for running geospatial analytics at scale.

To the best of our knowledge, Geotrellis is one of the first if not only geospatial libraries on Spark. Being in its nascent stages, we would like to present the work we have done so far and get feedback and hopefully contributions from the vibrant Spark community. The talk will focus on three aspects of Geotrellis on Spark – i) how data is stored in HDFS, ii) how geospatial algorithms are modeled as RDD operations and also how we used some advanced Spark features like custom partitioners, and finally iii) performance numbers for both web-scale and cluster-scale use-cases.

Dr. Ameet Kini is a Principal Engineer at DigitalGlobe where he works on developing novel distributed algorithms for solving geospatial problems. His primary area of expertise is designing scalable data architectures, and has spent the last decade working within the R&D/product development divisions of companies such as as Oracle, IBM, Google, and MITRE. He has a B.S. from UMBC and a M.S./Ph.D. from the U of Wisconsin-Madison, all in Computer Science.

Slides PDF |Video