Something really exciting and largely unnoticed is going on in the Spark ecosystem. As data scientists and engineers learn Spark, they’re actually all implicitly learning a much older, more general topic: typed functional programming. While Spark itself was built on an accumulation of powerful computer science concepts from functional programming and other areas, developers are often encountering these ideas in the context of Spark for the first time. It turns out that Spark makes an excellent platform for learning concepts like immutability, higher order and anonymous functions, laziness, and monadic operators.
This talk will discuss how Spark can be used as teaching tool, to build skills in areas like typed functional programming. We’ll explore a skill-building curriculum that can be used with a data scientist or engineer who only has experience in imperative, dynamically-typed languages like Python. This curriculum introduces the core concepts of functional programming and type theory, while providing learners the opportunity to immediately apply their skills at massive scale, using the power of Spark’s painless scalability and resilience.
Based on the experience of building machine learning teams at x.ai and other data-centric startups, this curriculum is the foundation of building poly-skilled, highly autonomous team members who can build scalable intelligent systems. We’ll work from foundational concepts of Scala and functional programming towards a fully implemented machine learning pipeline, all using Spark and MLlib. Unique new features of Spark like Datasets and Structured Streaming will be particularly useful in this effort. Using this approach, teams can help members in all roles learn how to use sophisticated programming techniques that ensure correctness at scale. With these skills in their toolbox, data scientists and engineers often find that building powerful machine learning systems is intuitive, easy, and even fun.
Jeff Smith builds large-scale machine learning systems using Scala and Spark. For the past decade, he has been working on data science applications at various startups in New York, San Francisco, and Hong Kong. He’s a frequent blogger and the author of “Reactive Machine Learning Systems,” an upcoming book from Manning on how to build real-world machine learning systems using Scala, Akka, and Spark.
Rohan has used Spark to build enterprise and personal projects for the better part of a year. He is also the perfect case study for this talk as he used Spark to learn aspects of functional programming through Scala (which in turn landed him his current gig as a data engineer working on Scala applications at x.ai). In past lives, he has worked with other big data processing frameworks and received a Masters in Electrical and Computer Engineering from Carnegie Mellon University.