DASC: data aware algorithm for scalable clustering
Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume , while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume , taking advantage of the MR fr...
Saved in:
| Published in | Knowledge and information systems Vol. 50; no. 3; pp. 851 - 881 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
London
Springer London
01.03.2017
Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0219-1377 0219-3116 |
| DOI | 10.1007/s10115-016-0958-4 |
Cover
| Summary: | Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for
Volume
, while handling of
Variety
and
Velocity
needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on
Volume
, taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of
big data
by virtue of being (i) single scan and distributed to handle
Volume
, (ii) incremental to cope with
Velocity
and (iii) versatile in handling numeric and categorical data to accommodate
Variety
. DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage,
hyperclusters
are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware.
Hyperclusters
are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in
MRC
0
class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time. |
|---|---|
| Bibliography: | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0219-1377 0219-3116 |
| DOI: | 10.1007/s10115-016-0958-4 |