A Scalable Hierarchical Clustering Algorithm Using Spark

Slides PDF Video

Clustering is often an essential first step in datamining intended to reduce redundancy, or define data categories. Hierarchical clustering, a widely used clustering technique, can
offer a richer representation by suggesting the potential group
structures. However, parallelization of such an algorithm is challenging as it exhibits inherent data dependency during the hierarchical tree construction. In this paper, we design a
parallel implementation of Single-linkage Hierarchical Clustering by formulating it as a Minimum Spanning Tree problem. We further show that Spark is a natural fit for the parallelization of
single-linkage clustering algorithm due to its natural expression
of iterative process. Our algorithm can be deployed easily in
Amazon’s cloud environment. And a thorough performance
evaluation in Amazon’s EC2 verifies that the scalability of our
algorithm sustains when the datasets scale up.

Chen Jin, Software Engineer at Uber

About Chen

Chen has been a software engineer at Uber since 2015, where she has built several active services (such as streamio, eater-surge service). Prior to joining Uber, she worked on BigTable product at Palantir, an interactive large-scale data analytical platform used by several influential clients then. Before that, she spent 4 years in PhD program at Northwestern University and focused her researching on scaling data mining in big data.