Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.
Much of the Big Data that Spark welders tackle is unstructured text that requires text processing techniques. For example, performing named entity extraction on tweets or sentiment analysis on customer reviews are common activities. The Unstructured Information Management Architecture (UIMA) framework is an Apache project that provides APIs and infrastructure for building complex and robust text analytics systems. A typical system built on UIMA defines a collection of analysis engines (such as e.g. a tokenizer, part-of-speech tagger, named entity recognizer, etc.) which are executed according to arbitrarily complex flow control definitions. The framework makes it possible to have interoperable components in which best-of-breed solutions can be mixed and matched and chained together to create sophisticated text processing pipelines. However, UIMA can seem like a heavy weight solution that has a sprawling API, is cumbersome to configure, and is difficult to execute. Furthermore, UIMA provides its own distributed computing infrastructure and run time processing engines that overlap, in their own way, with Spark functionality. In order for Spark to benefit from UIMA, the latter must be light-weight and nimble and not impose its architecture and tooling onto Spark.
In this talk, I will introduce a project that I started called uimaFIT which is now part of the UIMA project (http://uima.apache.org/uimafit.html). With uimaFIT it is possible to adopt UIMA in a very light-weight way and leverage it for what it does best: text processing. An entire UIMA pipeline can be encapsulated inside a single function call that takes, for example, a string input parameter and returns named entities found in the input string. This allows one to call a Spark RDD transform (e.g. map) that performs named entity recognition (or whatever text processing tasks your UIMA components accomplish) on string values in your RDD. This approach requires little UIMA tooling or configuration and effectively reduces UIMA to a text processing library that can be called rather than requiring full-scale adoption of another platform. I will prepare a companion resource for this talk that will provide a complete, self-contained, working example of how to leverage UIMA using uimaFIT from within Spark.
Philip Ogren is senior software developer at Oracle where he works on a Spark-based Big Data project. He holds a PhD in Computer Science from the University of Colorado at Boulder where he worked on ClearTK, a UIMA-based framework for creating statistical natural language processing components.