Bloomberg has a strong reputation in the financial industry for providing lightning fast analytics on vast quantities of data. In this presentation, we talk about Bloomberg’s analytics stack and how Spark, with its formidable computational model for distributed, high-performance analytics, helps take this to the next level. We talk about the kinds of analytics that are being expressed in Spark and how these pose challenges in terms of what Spark is currently capable of, in terms of functionality and performance. At Bloomberg, instead of building isolated Spark applications for individual problem domains, we are looking at implementing a framework based approach to registering, discovering, and querying RDD/DFs and real-time data streams. RDDs/DFs in the framework are cataloged in a registry, which captures data provenance (backing stores and real-time streams) as well as analytical and domain specific metadata. This allows for composable analytics over continously updated data, with significantly less boilerplate code for data plumbing. The results of these analytics can be registered back in the catalog, to be leveraged in higher order analytics. With such a data catalog, connectors to various internal data systems and standardized serverization runtimes for hosted Spark applications, Spark can allow for seamless integration between disparate datastores and data domains. We round out this talk by discussing a few challenges with building analytics infrastructure over Spark – need for dynamic topic registration, efficient stream reconciliation with updateStateByKey and context sharing for low-latency analytics while achieving efficient resource utilization.
Partha Nageswaran, is an R&D Lead at Bloomberg where he’s architecting and helping build out a Low Latency Analytics Ecosystem for Financial Analytics. As a former Advisory Engineer at IBM, he worked on building distributed platforms and holds three patents. He holds three Masters degrees, in math, physics, and computer science.
Sudarshan Kadambi is an Architect at Bloomberg, helping build Bloomberg’s Data and Compute infrastructure. He has a background in distributed systems, has been a long-time user of Hadoop, and, more recently, Spark and is passionate about making these technologies awesome.