Data Science Deep Dive: Spark ML with High Dimensional Labels

Slides PDF Video

This talk is detailed extension of our Spark Summit East talk on the same topic. We will review the hurdles we faced and the work arounds we developed with the help from Databricks support in our attempt to build a custom machine learning model and use it to predict the TV ratings for different networks and demographics. Attendees should leave this session with enough knowledge to recognize situations where our method would be applicable to implement it.

Specifically, we dig into the details of the data characteristics that make our problem inherently challenging and how we compose existing tools in the ML and Dataframes APIs to create a machine learning pipeline capable of learning real valued vector labels despite relatively low dimensional feature spaces. Our deep dive will include feature engineering techniques employed, the reference architecture for our n-dimensional regression technique and the extra data formatting steps required for applying the built in evaluator tools to n-dimensional models.

Michael Zargham, Director, Data Science at Cadent

About Michael

Michael has a PhD in Optimization and Decision Science from the University of Pennsylvania with a focus on constrained resource allocation problems. Michael leads the Data Science and Engineering initiatives at Cadent, a leading provider of media, advertising technology and data solutions for the pay-TV industry. He has also taught Convex Optimization at UPenn. He has been a practicing data driven business architect since 2005, working on various subcontracts during his undergraduate and graduate work.

Stefan Panayotov, Data Engineer at Cadent

About Stefan

Stefan in his current role as a Data Engineer at Cadent, focuses on Big Data computational platform solutions like Spark, that enables Cadent to leverage the Data Science and Machine Learning tasks for achieving faster and better business results. Previous to Cadent, Stefan was an Application Developer at QVC, where he worked on building logistic and warehouse software solutions for the retail industry. He’s also spent time as a SQL Developer at CCP, Senior Software Analyst at EXE Technologies, and an IT Consultant at UNISYS. Stefan received his PhD in Computer Science at the Bulgarian Academy of Sciences, where he also served as an Assistant Professor.