After you train a machine learning pipeline, it can be challenging to deploy it to production, maintain versioned histories of both the model and data used in training, and make the entire process reproducible. This talk will show you how to use Apache Spark with two open source tools, Pachyderm and MLeap, to achieve all of those goals.
Data provenance, provided by Pachyderm, gives a detailed audit of all data sources that go into your data pipeline at every step, as well as a rich, versioned history of your data. Spark ML provides a platform for training full machine learning pipelines on these versioned/tracked datasets, including feature generation/extraction and predictive models. Finally, MLeap provides the tools to instantly deploy, version, share and audit these machine learning pipelines in production. You are left with instant model deployments, full reproducibility of your entire pipeline from data import to production ML pipeline, and a full audit log ‒ from raw training data sources all the way to predictions being made by the model.
Learn how to set up these tools to produce a data pipeline with complete data provenance and model auditing so that your company can develop ML pipelines quickly, reproducibly and safely, with a high-level of visibility into every step.
Session hashtag: #SFds3
Daniel (@dwhitena) is a Ph.D. trained data scientist working with Pachyderm (@pachydermIO). Daniel develops innovative, distributed data pipelines which include predictive models, data visualizations, statistical analyses, and more. He has spoken at conferences around the world (Datapalooza, DevFest Siberia, GopherCon, and more), teaches data science/engineering with Ardan Labs (@ardanlabs), maintains the Go kernel for Jupyter, and is actively helping to organize contributions to various open source data science projects.
Hollin Wilkins is a founder of Combust, an ML/AI start-up in the Bay Area. He has been working on machine learning infrastructure since 2015, focusing on platforms for data scientists and engineers to rapidly iterate on ML algorithms and pipeline deployments. Previously he worked in the games industry at LindenLab on Blocksworld and Versu, helping to build everything from game UI, to servers, to custom logic languages that drive user experiences. He holds a degree in Biology from Cornell University and spends his free time hiking with his dog and snowboarding.