This talk will present an architecture employing Apache Accumulo to manage a distributed index in order to process spatially and temporally indexed datasets. Multi-dimensional spaces must be mapped to one-dimensional index of the RDD. The core concern is constructing the index to balance the opposing goals of even cluster utilization and data locality. The content of this talk will fit the following outline: – Overview Accumulo and its use in Spark. – How Accumulo can be used to construct a multi-dimensional index. – Minimizing large shuffles while avoiding hot-spotting in the context of cross RDD computations. – Benchmarks and demos. The two goals, optimal cluster utilization and optimal data locality, are fundamentally at odds when working across multiple RDDs or employing algorithms that access spatially neighboring records. Maintaining perfect data locality when working with irregularly distributed datasets leads to cluster hot-spotting. Sharding can be used to alleviate this issue but leads to excessive shuffles for some use-cases. The context for the talk will be the architecture of GeoTrellis, a library for distributed processing of map tiles. We will examine geo-spatial algorithms and the demands they place on the index. The benchmarks will compare sourcing tiles from Accumulo vs. HDFS MapFiles and performance of indexing schemes under those algorithms. The geo-spatial concepts will be used to provide the context, but the talk will not require deep understand of that domain. The concepts apply broadly to all multi-dimensional analysis.
Eugene Cheipesh is a software developer at Azavea on the GeoTrellis team, working to bring big data geo-spatial processing capability to the open source ecosystem.
Rob Emanuele is a software developer at Azavea and the lead of the GeoTrellis project. He is a member of the PMC at LocationTech, and is the Program Committee chair for FOSS4G NA 2015 (http://2015.foss4g-na.org)