Research on Parallel Adaptive Canopy-K-Means Clustering Algorithm for Big Data Mining Based on Cloud Platform

Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using...

Full description

Saved in:

Bibliographic Details
Published in	Journal of grid computing Vol. 18; no. 2; pp. 263 - 273
Main Authors	Xia, Dongliang, Ning, Feifei, He, Weina
Format	Journal Article
Language	English
Published	Dordrecht Springer Netherlands 01.06.2020 Springer Nature B.V
Subjects	Adaptive algorithms Algorithms Canopies Cloud computing Cluster analysis Clustering Computer Science Data mining Datasets Management of Computing and Information Systems Mathematical models Parameter estimation Processor Architectures User Interfaces and Human Computer Interaction Vector quantization Big data mining K-means clustering algorithm Canopy algorithm Parallel framework Cloud platform Spark framework
Online Access	Get full text
ISSN	1570-7873 1572-9184
DOI	10.1007/s10723-019-09504-z

Cover

More Information
Summary:	Firstly, this paper introduces the types of clustering algorithm, and introduces the classical K-means algorithm and canopy algorithm in detail. Then, combining the map reduce computing model and spark cloud computing framework, this paper introduces the parallel Canopy-K-means algorithm after using Canopy algorithm to optimize the initial value of K-means algorithm. However, because Canopy algorithm needs to introduce a new distance threshold parameter T2, and the parameter needs to be set by human experience, it is difficult to determine the parameter artificially for large data, so this paper proposes a parallel adaptive Canopy-K-means algorithm, which can be used in cloud computing framework to determine the distance threshold parameter T2 adaptively based on statistical method. Using the parallelism of Map-Reduce computing model, the parallel Canopy-K-means algorithm is optimized by adaptive parameter estimation, which solves the problem that parameters depend on manual experience selection in Canopy process. After introducing the relevant theories and derivation process of this algorithm, cloud computing experiment platform is built based on the Spark framework, and the contrast experiments were performed using the Stanford Large Network Dataset Collection (SNAP) dataset and self-built Dimension Networks dataset. The experimental results show that the proposed method is effective.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1570-7873 1572-9184
DOI:	10.1007/s10723-019-09504-z