Silhouette width using generalized mean—A flexible method for assessing clustering efficiency
Cluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used index for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. Silhouette combines two clustering crite...
Saved in:
Published in | Ecology and evolution Vol. 9; no. 23; pp. 13231 - 13243 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
England
John Wiley & Sons, Inc
01.12.2019
John Wiley and Sons Inc Wiley |
Subjects | |
Online Access | Get full text |
ISSN | 2045-7758 2045-7758 |
DOI | 10.1002/ece3.5774 |
Cover
Summary: | Cluster analysis plays vital role in pattern recognition in several fields of science. Silhouette width is a widely used index for assessing the fit of individual objects in the classification, as well as the quality of clusters and the entire classification. Silhouette combines two clustering criteria, compactness and separation, which imply that spherical cluster shapes are preferred over others—a property that can be seen as a disadvantage in the presence of complex, nonspherical clusters, which is common in real situations. We suggest a generalization of the silhouette width using the generalized mean. By changing the p parameter of the generalized mean between −∞ and +∞, several specific summary statistics, including the minimum, maximum, the arithmetic, harmonic, and geometric means, can be reproduced. Implementing the generalized mean in the calculation of silhouette width allows for changing the sensitivity of the index to compactness versus connectedness. With higher sensitivity to connectedness, the preference of silhouette width toward spherical clusters should reduce. We test the performance of the generalized silhouette width on artificial data sets and on the Iris data set. We examine how classifications with different numbers of clusters prepared by different algorithms are evaluated, if p is set to different values. When p was negative, well‐separated clusters achieved high silhouette widths despite their elongated or circular shapes. Positive values of p increased the importance of compactness; hence, the preference toward spherical clusters became even more detectable. With low p, single linkage clustering was deemed the most efficient clustering method, while with higher parameter values the performance of group average, complete linkage, and beta flexible with beta = −0.25 seemed better. The generalized silhouette allows for adjusting the contribution of compactness and connectedness criteria, thus avoiding underestimation of clustering efficiency in the presence of clusters with high internal heterogeneity.
Silhouette width is widely used for assessing clustering efficiency in various scientific fields including molecular biology, ecology, and taxonomy. We present a generalization of this index which allows for adjusting its sensitivity to cluster compactness and connectedness. With the proposed flexible formula, it is possible to evaluate clusters of different shape, size, and internal variation more reliably. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 2045-7758 2045-7758 |
DOI: | 10.1002/ece3.5774 |