Generating features is one of the most important and little discussed aspects of building machine learning systems. Because it’s easy to get started, data teams can often wind up with a bunch of features that have all sorts of problems. Some are inaccurate, some do repeated work, some are unreliable, some are unnecessary, and the whole thing is just too slow and opaque. In this talk, I’ll show you how features don’t have to be such a pain by using techniques from the reactive approach to machine learning. When building reactive machine learning systems, we try to hold our complicated large-scale machine learning systems to the same standard as modern web and mobile apps. Using the Reactive Manifesto as our guide, we can see the traits we want our machine learning system to have: responsiveness, elasticity, resilience, and so on. This talk focuses in on how to achieve those traits for feature generation pipelines specifically. We’ll start by building on top of MLlib’s feature generation capabilities and the broader capabilities of Spark, but we’ll go well beyond the example code in the programming guide. Key points along the way will include:
• Structuring feature transforms
• Supervising feature pipelines
• Operating on collections of features
• Techniques for validating features
You don’t have to be frustrated by your feature generation code! Using the power of Spark and the principles of reactive machine learning, you too can have awesome feature generation capabilities that help you achieve your data science goals.
Jeff Smith builds large-scale machine learning systems using Scala and Spark. For the past decade, he has been working on data science applications at various startups in New York, San Francisco, and Hong Kong. He’s a frequent blogger and the author of “Reactive Machine Learning Systems,” an upcoming book from Manning on how to build real-world machine learning systems using Scala, Akka, and Spark.