San Francisco
June 30 - July 2, 2014


Spark Summit 2015e
Streaming machine learning in Spark
Jeremy Freeman (HHMI Janelia Research Center)

Many kinds of data are acquired sequentially over time. Rather than wait for data to be collected, streaming analyses let us identify patterns – and make decisions based on them – as data start arriving. When data are non-stationary, and patterns change over time, streaming analyses adapt. And at scales where storing raw data becomes impractical, streaming analyses let us persist only smaller, more targeted representations. This talk will describe machine learning approaches to analyzing streams of data using Spark Streaming. We have introduced abstractions that generalize stochastic gradient update rules to the streaming case. We have also developed a scalable, streaming clustering algorithm, with an intuitive parameterization that controls the time scale of adaptation. Both analyses are now available as part of MlLib. I will also describe analyses specifically for time series data, that maintain models associated with different streams and relate them to one another, alongside visualization layers for interactive inspection. We are using these techniques to analyze large-scale neural recordings, providing live representations of neural activity as data arrive, and supporting targeted interrogations of brain function.

Neuroscientist at HHMI Janelia Research Center using computation to understand the brain. Passionate about brains, behavior, analytics, big data, visualization, open source, and open science.