Large-Scaled Insurance Analytics Using Tweedie Models in Apache Spark

Slides PDF Video

An emerging trend in automobile insurance is to charge drivers based on how much they drive, as well as how, where and when they drive. Data scientists need to analyze trip-level indicators that affect auto accidents, namely characteristics specific to a driver going from point A to point B. This detailed level of analysis presents several new challenges. First, the collection of detailed trip-level data results in huge volumes of data that must be processed efficiently. Second, automobile accidents are low-frequency events, and a large portion of trips have zero insurance loss. Third, the driver-level variation is influential on statistical inference, but the modeling of a large number of driver-level effects leads to a high-dimensional estimation problem.

Learn how Uber uses a novel implementation of the Tweedie mixed-effect model on the Spark distributed computing engine to address these challenges. The Tweedie compound Poisson distribution is employed to model zero-inflated insurance losses, and random effects are used to account for the inherent heterogeneity of drivers. This implementation on Spark enables efficient processing of high volumes of trip-level data and fast estimation of high-dimensional models.

Hear about the challenges of estimating mixed-effects models in the distributed environment, and the details of Uber’s implementation.

Session hashtag: #SFds7

Yanwei Zhang, Senior Data Scientist at Uber

About Yanwei

Yanwei (Wayne) Zhang is a senior data scientist at Uber Technology Inc. He has a Master’s degree in statistics and a PhD degree in quantitative marketing. He has published several research papers at top journals in statistics and actuarial science. His interest is in large-scaled machine learning, with a focus on applications in insurance and actuarial science.