Using SparkML to Power a DSaaS (Data Science as a Service)

Slides PDF Video

Almost all organizations now have a need for datascience and as such the main challenge after determining the algorithm is to scale it up and make it operational. We at comcast use several tools and technologies such as Python, R, SaS, H2O and so on.
In this talk we will show how many common use cases use the common algorithms like Logistic Regression, Random Forest, Decision Trees , Clustering, NLP etc.

Spark has several Machine Learning algorithms built in and has excellent scalability. Hence we at comcast built a platform to provide DSaaS on top of Spark with REST API as a means of controlling and submitting jobs so as to abstract most users from the rigor of writing(repeating ) code instead focusing on the actual requirements. We will show how we solved some of the problems of establishing feature vectors, choosing algorithms and then deploying models into production.

We will showcase our use of Scala, R and Python to implement models using language of choice yet deploying quickly into production on 500 node Spark clusters.

About Shekhar

Shekhar Agrawal works as the Director, Data science at Comcast where he manages a team of data scientists working on sophisticated data science applications and modeling at PB scale using several platforms including Spark

Sridhar Alla, Director of Big Data Solutions at Comcast

About Sridhar

Sridhar Alla currently works as the Director of Big Data Solutions and Architecture at Comcast, where he has delivered several key solutions, such as the XFinity personalization platform, ClickthruAnalytics, Correlation platform, etc. Sridhar started his career in network appliances on NAS and caching technologies. He also served as the CTO of security company eIQNetworks, where he merged the concepts of big data and security products. He holds patents on topics of very large scale processing algorithms and caching.