Stability estimation for unsupervised clustering: A review

Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different obje...

Full description

Saved in:
Bibliographic Details
Published inWiley interdisciplinary reviews. Computational statistics Vol. 14; no. 6; pp. e1575 - n/a
Main Authors Liu, Tianmou, Yu, Han, Blair, Rachael Hageman
Format Journal Article
LanguageEnglish
Published Hoboken, USA John Wiley & Sons, Inc 01.11.2022
Subjects
Online AccessGet full text
ISSN1939-5108
1939-0068
1939-0068
DOI10.1002/wics.1575

Cover

More Information
Summary:Cluster analysis remains one of the most challenging yet fundamental tasks in unsupervised learning. This is due in part to the fact that there are no labels or gold standards by which performance can be measured. Moreover, the wide range of clustering methods available is governed by different objective functions, different parameters, and dissimilarity measures. The purpose of clustering is versatile, often playing critical roles in the early stages of exploratory data analysis and as an endpoint for knowledge and discovery. Thus, understanding the quality of a clustering is of critical importance. The concept of stability has emerged as a strategy for assessing the performance and reproducibility of data clustering. The key idea is to produce perturbed data sets that are very close to the original, and cluster them. If the clustering is stable, then the clusters from the original data will be preserved in the perturbed data clustering. The nature of the perturbation, and the methods for quantifying similarity between clusterings, are nontrivial, and ultimately what distinguishes many of the stability estimation methods apart. In this review, we provide an overview of the very active research area of cluster stability estimation and discuss some of the open questions and challenges that remain in the field. This article is categorized under: Statistical Learning and Exploratory Methods of the Data Sciences > Clustering and Classification Grouping items into clusters is a complex problem in unsupervised learning with inherent uncertainty. Stability is a measurement that characterizes the strength and reproducibility of a cluster and an items membership to a cluster.
Bibliography:Funding Information
Edited by
Nicole Lazar, Commissioning Editor and David Scott, Co‐Editor‐in‐Chief
Rachael Hageman Blair was supported by the NSF DMS 1557589. Han Yu was supported by the National Cancer Institute Cancer Center Support (Grant P30CA016056) and National Cancer Institute IOTN Moonshot (Grant U24CA232979).
Correction added on 21 January 2022, after first online publication: The copyright line was changed.
ISSN:1939-5108
1939-0068
1939-0068
DOI:10.1002/wics.1575