Extending Spark Machine Learning: Adding Your Own Algorithms and Tools

Slides PDF Video

Apache Spark’s machine learning (ML) pipelines provide a lot of power, but sometimes the tools you need for your specific problem aren’t available yet. This talk introduces Spark’s ML pipelines, and then looks at how to extend them with your own custom algorithms. By integrating your own data preparation and machine learning tools into Spark’s ML pipelines, you will be able to take advantage of useful meta-algorithms, like parameter searching and pipeline persistence (with a bit more work, of course).

Even if you don’t have your own machine learning algorithms that you want to implement, this session will give you an inside look at how the ML APIs are built. It will also help you make even more awesome ML pipelines and customize Spark models for your needs. And if you don’t want to extend Spark ML pipelines with custom algorithms, you’ll still benefit by developing a stronger background for future Spark ML projects.

The examples in this talk will be presented in Scala, but any non-standard syntax will be explained.

Session hashtag: #SFml2

Holden Karau, Software Engineer at IBM

About Holden

Holden Karau is transgender Canadian, an active open source contributor, and co-author of Learning Spark & High Performance Spark. When not in San Francisco working as a software development engineer at IBM’s Spark Technology Center, Holden talks internationally on Spark and holds office hours at coffee shops at home and abroad. She makes frequent contributions to Spark, specializing in PySpark and Machine Learning. Prior to IBM she worked on a variety of distributed, search, and classification problems at Alpine, Databricks, Google, Foursquare, and Amazon. She graduated from the University of Waterloo with a Bachelor of Mathematics in Computer Science.

Seth Hendrickson, Data Scientist at Cloudera

About Seth

Seth Hendrickson is a top Apache Spark contributor. He implemented multinomial logistic regression with elastic-net regularization in Spark’s ML library and has contributed several other performance improvements to linear models in Spark. He has also made extensive contributions to Spark ML decision trees and ensemble algorithms. Prior to joining IBM, Seth was an electrical engineer working on signal processing and IOT. He earned his M.S. in electrical engineering from Georgia Institute of Technology.