Improving Scalable K-Means

Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for...

Full description

Saved in:

Bibliographic Details
Published in	Algorithms Vol. 14; no. 1; p. 6
Main Authors	Hämäläinen, Joonas, Kärkkäinen, Tommi, Rossi, Tuomo
Format	Journal Article
Language	English
Published	Basel MDPI AG 01.01.2021
Subjects	Algorithms Cluster analysis Clustering clustering initialization Datasets Feature selection K-means Methods Principal components analysis Prototypes random projection Subspaces Vector quantization
Online Access	Get full text
ISSN	1999-4893 1999-4893
DOI	10.3390/a14010006

Cover

More Information
Summary:	Two new initialization methods for K-means clustering are proposed. Both proposals are based on applying a divide-and-conquer approach for the K-means‖ type of an initialization strategy. The second proposal also uses multiple lower-dimensional subspaces produced by the random projection method for the initialization. The proposed methods are scalable and can be run in parallel, which make them suitable for initializing large-scale problems. In the experiments, comparison of the proposed methods to the K-means++ and K-means‖ methods is conducted using an extensive set of reference and synthetic large-scale datasets. Concerning the latter, a novel high-dimensional clustering data generation algorithm is given. The experiments show that the proposed methods compare favorably to the state-of-the-art by improving clustering accuracy and the speed of convergence. We also observe that the currently most popular K-means++ initialization behaves like the random one in the very high-dimensional cases.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1999-4893 1999-4893
DOI:	10.3390/a14010006