Extending k-Means-Based Algorithms for Evolving Data Streams with Variable Number of Clusters

Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we...

Full description

Saved in:
Bibliographic Details
Published in2011 Tenth International Conference on Machine Learning and Applications Vol. 2; pp. 14 - 19
Main Authors de Andrade Silva, J., Hruschka, E. R.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2011
Subjects
Online AccessGet full text
ISBN9781457721342
1457721341
DOI10.1109/ICMLA.2011.67

Cover

More Information
Summary:Many algorithms for clustering data streams based on the widely used k-Means have been proposed in the literature. Most of them assume that the number of clusters, k, is known and fixed a priori by the user. Aimed at relaxing this assumption, which is often unrealistic in practical applications, we describe an algorithmic framework that allows estimating k automatically from data. We illustrate the potential of the proposed framework by using three state-of-the-art algorithms for clustering data streams - Stream LSearch, CluStream, and Stream KM++ - combined with two well-known algorithms for estimating the number of clusters, namely: Ordered Multiple Runs of k-Means (OMRk) and Bisecting k-Means (BkM). As an additional contribution, we experimentally compare the resulting algorithmic instantiations in both synthetic and real-world data streams. Analyses of statistical significance suggest that OMRk yields to the best data partitions, while BkM is more computationally efficient. Also, the combination of Stream KM++ with OMRk leads to the best trade-off between accuracy and efficiency.
ISBN:9781457721342
1457721341
DOI:10.1109/ICMLA.2011.67