A Fast Granular-Ball-Based Density Peaks Clustering Algorithm for Large-Scale Data

Density peaks clustering algorithm (DP) has difficulty in clustering large-scale data, because it requires the distance matrix to compute the density and <inline-formula> <tex-math notation="LaTeX">\delta </tex-math></inline-formula>-distance for each object, which...

Full description

Saved in:
Bibliographic Details
Published inIEEE transaction on neural networks and learning systems Vol. 35; no. 12; pp. 17202 - 17215
Main Authors Cheng, Dongdong, Li, Ya, Xia, Shuyin, Wang, Guoyin, Huang, Jinlong, Zhang, Sulan
Format Journal Article
LanguageEnglish
Published United States IEEE 01.12.2024
Subjects
Online AccessGet full text
ISSN2162-237X
2162-2388
2162-2388
DOI10.1109/TNNLS.2023.3300916

Cover

More Information
Summary:Density peaks clustering algorithm (DP) has difficulty in clustering large-scale data, because it requires the distance matrix to compute the density and <inline-formula> <tex-math notation="LaTeX">\delta </tex-math></inline-formula>-distance for each object, which has <inline-formula> <tex-math notation="LaTeX">O(n^{2}) </tex-math></inline-formula> time complexity. Granular ball (GB) is a coarse-grained representation of data. It is based on the fact that an object and its local neighbors have similar distribution and they have high possibility of belonging to the same class. It has been introduced into supervised learning by Xia et al. to improve the efficiency of supervised learning, such as support vector machine, <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-nearest neighbor classification, rough set, etc. Inspired by the idea of GB, we introduce it into unsupervised learning for the first time and propose a GB-based DP algorithm, called GB-DP. First, it generates GBs from the original data with an unsupervised partitioning method. Then, it defines the density of GBs, instead of the density of objects, according to the centers, radius, and distances between its members and centers, without setting any parameters. After that, it computes the distance between the centers of GBs as the distance between GBs and defines the <inline-formula> <tex-math notation="LaTeX">\delta </tex-math></inline-formula>-distance of GBs. Finally, it uses GBs' density and <inline-formula> <tex-math notation="LaTeX">\delta </tex-math></inline-formula>-distance to plot the decision graph, employs DP algorithm to cluster them, and expands the clustering result to the original data. Since there is no need to calculate the distance between any two objects and the number of GBs is far less than the scale of a data, it greatly reduces the running time of DP algorithm. By comparing with <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-means, ball <inline-formula> <tex-math notation="LaTeX">k </tex-math></inline-formula>-means, DP, DPC-KNN-PCA, FastDPeak, and DLORE-DP, GB-DP can get similar or even better clustering results in much less running time without setting any parameters. The source code is available at https://github.com/DongdongCheng/GB-DP .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2162-237X
2162-2388
2162-2388
DOI:10.1109/TNNLS.2023.3300916