Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we’ll show how to train linear models with Elastic-Net regularization using MLlib.
DB Tsai is an Apache Spark committer and a Senior Research Engineer at Netflix. He is recently working with Apache Spark community to add several new algorithms including Linear Regression and Binary Logistic Regression with ElasticNet (L1/L2) regularization, Multinomial Logistic Regression, and LBFGS optimizer. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he developed innovative large-scale distributed linear algorithms, and then contributed back to open source Apache Spark project.
Steven Hillion has been leading large engineering and analytics projects for fifteen years. Before joining Alpine Data Labs, he founded the data science team at Greenplum and also led various open-source projects in analytics and machine learning. Before that, at Kana and M-Factor, he built analytics applications and innovated in the area of econometric modeling. At Scopus Technology (later Siebel Systems) he co-founded development groups for finance, telecom and other verticals. He has a background in pure mathematics, earning a Ph.D. in analytic number theory from the University of California at Berkeley.