DIDES: a fast and effective sampling for clustering algorithm

As clustering algorithms become more and more sophisticated to cope with current needs, large data sets of increasing complexity, sampling is likely to provide an interesting alternative. The proposal is a distance-based algorithm: The idea is to iteratively include in the sample the furthest item f...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge and information systems Vol. 50; no. 2; pp. 543 - 568
Main Authors	Ros, Frédéric, Guillaume, Serge
Format	Journal Article
Language	English
Published	London Springer London 01.02.2017 Springer Nature B.V Springer
Subjects	Algorithms Analysis Bias Cluster analysis Clustering Computer Science Data mining Data Mining and Knowledge Discovery Database Management Datasets Density Environmental Sciences Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Mathematical analysis Optimization Pattern recognition Regular Paper Sample size Sampling Studies Space coverage Density Rand index Distance DENSITY SAMPLING MATHEMATIQUES ALGORITHM ALGORITHME ECHANTILLONNAGE DENSITE
Online Access	Get full text
ISSN	0219-1377 0219-3116 0219-3116
DOI	10.1007/s10115-016-0946-8

Cover

More Information
Summary:	As clustering algorithms become more and more sophisticated to cope with current needs, large data sets of increasing complexity, sampling is likely to provide an interesting alternative. The proposal is a distance-based algorithm: The idea is to iteratively include in the sample the furthest item from all the already selected ones. Density is managed within a postprocessing step, and either low- or high-density areas are considered. The algorithm has some nice properties: insensitive to initialization, data size and noise, it is accurate according to the Rand index and avoids many distance calculations thanks to internal optimization. Moreover, it is driven by only one, meaningful, parameter, called granularity, which impacts the sample size. Compared with concurrent approaches, it proved to be as powerful as the best known methods, with the lowest CPU cost.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	0219-1377 0219-3116 0219-3116
DOI:	10.1007/s10115-016-0946-8