DASC: data aware algorithm for scalable clustering

Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume , while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume , taking advantage of the MR fr...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge and information systems Vol. 50; no. 3; pp. 851 - 881
Main Authors	Bhatnagar, Vasudha, Kaur, Sharanjit, Saxena, Rakhi, Khanna, Dhriti
Format	Journal Article
Language	English
Published	London Springer London 01.03.2017 Springer Nature B.V
Subjects	Adaptation Algorithms Analysis Artificial intelligence Batch processing Big Data Clustering Computer Science Data mining Data Mining and Knowledge Discovery Data points Database Management Datasets Handling Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Real time Regular Paper Scalability Scaling Stability analysis Stores Studies Velocity Incremental clustering Complexity class Scalable clustering Synopsis MapReduce
Online Access	Get full text
ISSN	0219-1377 0219-3116
DOI	10.1007/s10115-016-0958-4

Cover

More Information
Summary:	Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume , while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume , taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of big data by virtue of being (i) single scan and distributed to handle Volume , (ii) incremental to cope with Velocity and (iii) versatile in handling numeric and categorical data to accommodate Variety . DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage, hyperclusters are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware. Hyperclusters are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in MRC 0 class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-1 ObjectType-Feature-2 content type line 23
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-016-0958-4