An emerging trend in automobile insurance is to charge drivers based on how much they drive, as well as how, where and when they drive. Data scientists need to analyze trip-level indicators that affect auto accidents, namely characteristics specific to a driver going from point A to point B. This detailed level of analysis presents several new challenges. First, the collection of detailed trip-level data results in huge volumes of data that must be processed efficiently. Second, automobile accidents are low-frequency events, and a large portion of trips have zero insurance loss. Third, the driver-level variation is influential on statistical inference, but the modeling of a large number of driver-level effects leads to a high-dimensional estimation problem.
Learn how Uber uses a novel implementation of the Tweedie mixed-effect model on the Spark distributed computing engine to address these challenges. The Tweedie compound Poisson distribution is employed to model zero-inflated insurance losses, and random effects are used to account for the inherent heterogeneity of drivers. This implementation on Spark enables efficient processing of high volumes of trip-level data and fast estimation of high-dimensional models.
Hear about the challenges of estimating mixed-effects models in the distributed environment, and the details of Uber’s implementation.
Session hashtag: #SFds7
Yanwei (Wayne) Zhang is a senior data scientist at Uber Technology Inc. He has a Master’s degree in statistics and a PhD degree in quantitative marketing. He has published several research papers at top journals in statistics and actuarial science. His interest is in large-scaled machine learning, with a focus on applications in insurance and actuarial science.