Cluster Quality Analysis Using Silhouette Score

Clustering is an important phase in data mining. Selecting the number of clusters in a clustering algorithm, e.g. choosing the best value of k in the various k-means algorithms [1], can be difficult. We studied the use of silhouette scores and scatter plots to suggest, and then validate, the number...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE 7th International Conference on Data Science and Advanced Analytics (DSAA) pp. 747 - 748
Main Authors	Shahapure, Ketan Rajshekhar, Nicholas, Charles
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2020
Subjects	Benchmark testing Clustering algorithms Computer science Electrical engineering Inspection Iris Writing
Online Access	Get full text
DOI	10.1109/DSAA49011.2020.00096

Cover

More Information
Summary:	Clustering is an important phase in data mining. Selecting the number of clusters in a clustering algorithm, e.g. choosing the best value of k in the various k-means algorithms [1], can be difficult. We studied the use of silhouette scores and scatter plots to suggest, and then validate, the number of clusters we specified in running the k-means clustering algorithm on two publicly available data sets. Scikit-learn's [4] silhouette score method, which is a measure of the quality of a cluster, was used to find the mean silhouette co-efficient of all the samples for different number of clusters. The highest silhouette score indicates the optimal number of clusters. We present several instances of utilizing the silhouette score to determine the best value of k for those data sets.
DOI:	10.1109/DSAA49011.2020.00096