Dynamic Community Detection for Large-scale e-Commerce data with Spark Streaming and GraphX

Slides PDF Video

Nowadays most production graph data in Alibaba are very huge, especially for complex network problem. And it comes with a urgent challenge that more efficient community detection in a rapidly changing networks is needed. It is favorited for community model to be dynamically updated with real-time data streams, rather than daily update. In this way, better prediction of customer’s behavior and risk control can be achieved timely.

In our work, we propose Dynamic Community Detection, a hybrid process model which takes full advantage of Spark, combines with online incremental community detection using Spark Streaming, and offline daily rebuild on using Spark GraphX. Productional graph data demonstrate that Dynamic Community Detection can result in continuously stable modularity with high quality. Meanwhile, Dynamic Community Detection is much faster than offline algorithms, and show great potential in many areas like fraud detection, marketing strategy, and so on.

Photo of Ming Huang

About Ming

Ming Huang is technical leader of Machine Learning and Artificial Intelligence team in Taobao, Alibaba Group. He is one of the earliest Spark researcher and evangelist in China. His team brought the first Spark on Yarn cluster of China into production in 2013. Now he is focusing on large scale machine learning with MLLib and GraphX. Lots of recommendation and prediction models are built upon Spark by his data science team to solve Alibaba’s e-commerce business problems.