An Online Spark Pipeline: Semi-Supervised Learning and Automatic Retraining with Spark Streaming

Slides PDF Video

Real-time/online machine learning is an integral piece in the machine learning landscape, particularly in regard to unsupervised learning. Areas such as focused advertising, stock price prediction, recommendation engines, network evolution and IoT streams in smart cities and smart homes are increasing in demand and scale. Continuously-updating models with efficient update methodologies, accurate labeling, feature extraction, and modularity for mixed models are integral to maintaining scalability, precision, and accuracy in high demand scenarios.

This session explores a real-time/online learning algorithm and implementation using Spark Streaming in a hybrid batch/ semi-supervised setting. It presents an easy-to-use, highly scalable architecture with advanced customization and performance optimization. Within this framework, we will examine some of the key methodologies for implementing the algorithm, including partitioning and aggregation schemes, feature extraction, model evaluation and correction over time, and our approaches to minimizing loss and improving convergence. The result is a simple, accurate pipeline that can be easily adapted and scaled to a variety of use cases.

The performance of the algorithm will be evaluated comparatively against existing implementations in both linear and logistic prediction. The session will also cover real-time uses cases of the streaming pipeline using real time-series data and present strategies for optimization and implementation to improve both accuracy and efficiency in a semi-supervised setting.

Session hashtag: #SFds19

J White Bear,  at IBM

About J

University of Michigan—Computer Science
Databases, Machine Learning/Computational Biology, Cryptography

University of California San Francisco—Computational Biology/Bioinformatics
Machine Learning/Multi Objective Optimization/Statistical Mechanics for Protein-Protein Interactions

McGill University
Machine Learning/Multi-objective Optimization for Path Planning/ Cryptography