San Francisco
June 30 - July 2, 2014

Spark Summit 2014 brought the Apache Spark community together on June 30- July 2, 2014 at the The Westin St. Francis in San Francisco. It featured production users of Spark, Shark, Spark Streaming and related projects.


Spark Summit 2014
Text Analytics on Spark
Dimple Bhatia, Sudarshan Thitte (IBM)

IBM BigInsights Text Analytics is a fast, declarative rule-based information extraction (IE) system which can be effectively used to extract invaluable insights from unstructured content. This system comprises of a fast and efficient runtime which exploits numerous optimization techniques across extraction programs written in Annotation Query Language (AQL), an English-like declarative language for rule-based information extraction. Such a declarative language allows the rule logic to be very expressive, clear and open to better understanding, especially when debugging unexpected extraction results. Spark’s resilient, fast, in-memory distributed system is a perfect fit to expose an in-memory IE engine such as IBM BigInsights Text Analytics – such an integration will yield low-latency, highly interactive, efficient and comprehensible analytics pipelines to developers, scientists, analysts and line-of-business users alike, attempting non-trivial analyses.

In this talk, we will delve into this integration architecture, explain the fundamentals of our IBM BigInsights Text Analytics system, and demonstrate an instance of this integration with the help of IE programs towards extracting useful insights from regulatory filings and other data sources around an organization’s financials. Company filings filed with regulatory agencies besides news feeds are just two of the numerous sources of unstructured content of great interest to regulators, investors, analysts and bankers, trying to extract structured entity and relationship information buried deep within. These non-trivial extraction tasks yield insights which would help in modeling and analyzing institutions and industries along numerous dimensions such as investment decisions, counter-party event monitoring, estimation of financial indices besides other aspects.

Dimple Bhatia is a Senior Software Engineer at IBM’s Silicon Valley Lab. She is currently leading a web based application to build/refine/execute extractors for Text Analytics in the Big data organization.Previously, she has lead tooling for Database servers such Health Monitoring, Job Management. She has also worked in Federation Server and Data Warehousing technologies. She holds a master’s degree in computer engineering from Syracuse University in New York.

Sudarshan is a Staff Software Engineer in Text Analytics with IBM’s Big data group. Prior to this, he attended Stanford University for his graduate studies in Computer Science, specializing in Computational Linguistics and Data Mining. He enjoys distributed systems, data mining, information retrieval and coding in general. When not being a nerd, he can be found enjoying badminton, swimming and volleyball.

Slides PDF |Video