Streaming algorithms are becoming extremely important as people push more and more to real-time processing. Some of these algorithms are reasonably well known like k-min counters or hyper log log. There are other newer important algorithms available, however, like t-digest and streaming k-means. I will survey these and other algorithms in an approachable, but sound presentation on the most important algorithms of this kind. I will pay particular attention to the newer algorithms including t-digest which allows extremely accurate quantile computation, streaming k-means which allows accurate clustering with exactly one pass over the data and (nearly bounded storage), and truly real-time collaborative filtering.
Ted Dunning is PMC member of the Apache Mahout, Apache ZooKeeper, and Apache Drill projects and mentor for Apache Storm, DataFu, Kylin, Flink and Calcite projects. Ted was the chief architect behind the MusicMatch (now Yahoo Music) and Veoh recommendation systems. He built fraud detection systems for ID Analytics (LifeLock) and he has 24 patents issued to date and a dozen pending. Ted has a PhD in computing science from the University of Sheffield. When he’s not doing data science, he plays guitar and mandolin. He also bought the beer at the first Hadoop user group meeting.