Geospatial data is pervasive, and spatial context is a very rich signal of user intent and relevance in search and targeted advertising and an important variable in many predictive analytics applications. For example when a user searches for “canyon hotels”, without location awareness the top result or sponsored ads might be for hotels in the town “Canyon, TX”. However, if they are are near the Grand Canyon, the top results or ads should be for nearby hotels. Thus a search term combined with location context allows for much more relevant results and ads. Similarly a variety of other predictive analytics problems can leverage location as a context. To leverage spatial context in a predictive analytics application requires us to be able to parse these datasets at scale, join them with target datasets that contain point in space information, and answer geometrical queries efficiently. In this talk, we describe the motivation and the internals of an open source library that we are building for Geospatial Analytics using Spark SQL, DataFrames and Catalyst as the underlying engine. We outline how we leverage Catalyst’s pluggable optimizer to efficiently execute spatial joins, how SparkSQL’s powerful operators allow us to express geometric queries in a natural DSL, and discuss some of the geometric algorithms that we implemented in the library. We also describe the Python bindings that we expose, leveraging Pyspark’s Python integration.
Ram Sriharsha is a Senior Member of Technical Staff at Hortonworks focused on Spark, Machine Learning and Data Science. Ram is an Apache Spark Committer and PMC Member. Prior to joining Hortonworks, he was Principal Research Scientist at Yahoo Research where he worked on large scale machine learning algorithms and systems related to login risk detection, sponsored search advertising and advertising effectiveness measurement.