Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Everyone who has maintained a search cluster knows the pain of keeping our on-line update code and offline reindexing pipelines in sync. Subtle bugs can pop up when our data is indexed differently depending on the context. By using Spark & Spark Streaming we can reuse the same indexing code between contexts and even take advantage reduce overhead by talking directly to the correct indexing node.
Sometimes we need to use search data as part of our distributed map reduce jobs. We will illustrate how to use Elastic Search as side data source with Spark.
We will also illustrate both of these tasks in two real examples using the Twitter firehose. In the first we will index tweets in a geospatial context and in the second we will use the same index to determine the top hashtags per region.
Holden Karau is a software development engineer at Databricks and is active in open source. She is the author of Fast Data Processing With Spark. Prior to Databricks she worked on a variety of search and classification problems at Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelors of Mathematics in Computer Science. Outside of software she enjoys playing with fire, welding, and hula hooping.