At Stitch Fix, data scientists work on a variety of applications, including style recommendation systems, natural language processing, demand modeling and forecasting, and inventory analysis and recommendations. They’ve used Apache Spark as a crucial part of the infrastructure to support these diverse capabilities, running a large number of varying-sized jobs, typically 500-1,000 separate jobs each day – and often more.
Their scaling problems are around capabilities and handling many simultaneous jobs, rather than raw data size. They have a large team of around 80 data scientists who own their pipelines from start to finish; there is no “ETL engineering team” that takes over. As a result, Stitch Fix has developed a self-service approach with their infrastructure to make it easy for the team to submit and track jobs, and they’ve grown an internal community of Spark users to help each other get started with the use of Spark.
In this session, you’ll learn about Stitch Fix’s approach to using and managing Spark, including their infrastructure, execution service and other supporting tools. You’ll also hear about the types of capabilities they support with Spark, how the team transitioned to using Spark, and lessons learned along the way. Stitch Fix’s infrastructure utilizes many services from Amazon AWS, tools from Netflix OSS, as well as several home-grown applications.
Session hashtag: #SFent4
Derek Bennett is the lead for the Platform Infrastructure team in the Algorithms group at Stitch Fix. He and his team develop and support our Spark capabilities, event logging infrastructure using Amazon Kinesis and Apache Kafka, along with associated tools and applications to help make data available and useable. Derek holds a Ph.D. in Operations Research from UC Berkeley.