Data Clustering Algorithms and Applications

In this book, top researchers from around the world cover the entire area of clustering, from basic methods to more refined and complex data clustering approaches. They pay special attention to recent issues in graphs, social networks, and other domains. The book explores the characteristics of clus...

Full description

Saved in:

Bibliographic Details
Main Authors	Aggarwal, Charu C., Reddy, Chandan K.
Format	eBook Book
Language	English
Published	Boca Raton CRC Press 2014 CRC Press LLC Chapman & Hall
Edition	1
Series	Chapman & Hall/CRC data mining and knowledge discovery series
Subjects	Cluster analysis Data mining Document clustering File organization (Computer science) Machine theory Gene Expression Time Series Data Data Set Latent Dirichlet Allocation High Dimensional Scenario LDA Model Internal Validation Measures Spectral Clustering Suffix Tree Subspace Clustering Algorithm Semisupervised Clustering NMF Algorithm Frequent Item Set NMF Consensus Clustering Agglomerative Hierarchical Clustering Matrix Factorization Methods High Dimensional Data Cluster Ensembles Graph Partitioning DBSCAN Algorithm MCL Clustering Algorithm UCI Dataset Subspace Clustering Agglomerative Hierarchical Clustering Algorithm
Online Access	Get full text
ISBN	9781466558212 1466558210
DOI	10.1201/9781315373515

Cover

Table of Contents:

10.8.1 Clustering Uncertain Data Streams
Cover -- Half Title -- Title Page -- Copyright Page -- Table of Contents -- Preface -- Editor Biographies -- Contributors -- 1: An Introduction to Cluster Analysis -- 1.1 Introduction -- 1.2 Common Techniques Used in Cluster Analysis -- 1.2.1 Feature Selection Methods -- 1.2.2 Probabilistic and Generative Models -- 1.2.3 Distance-Based Algorithms -- 1.2.4 Density-and Grid-Based Methods -- 1.2.5 Leveraging Dimensionality Reduction Methods -- 1.2.5.1 GenerativeModels for Dimensionality Reduction -- 1.2.5.2 Matrix Factorization and Co-Clustering -- 1.2.5.3 Spectral Methods -- 1.2.6 The High Dimensional Scenario -- 1.2.7 Scalable Techniques for Cluster Analysis -- 1.2.7.1 I/O Issues in Database Management -- 1.2.7.2 Streaming Algorithms -- 1.2.7.3 The Big Data Framework -- 1.3 Data Types Studied in Cluster Analysis -- 1.3.1 Clustering Categorical Data -- 1.3.2 Clustering Text Data -- 1.3.3 Clustering Multimedia Data -- 1.3.4 Clustering Time-Series Data -- 1.3.5 Clustering Discrete Sequences -- 1.3.6 Clustering Network Data -- 1.3.7 Clustering Uncertain Data -- 1.4 Insights Gained from Different Variations of Cluster Analysis -- 1.4.1 Visual Insights -- 1.4.2 Supervised Insights -- 1.4.3 Multiview and Ensemble-Based Insights -- 1.4.4 Validation-Based Insights -- 1.5 Discussion and Conclusions -- 2: Feature Selection for Clustering: A Review -- 2.1 Introduction -- 2.1.1 Data Clustering -- 2.1.2 Feature Selection -- 2.1.3 Feature Selection for Clustering -- 2.1.3.1 Filter Model -- 2.1.3.2 Wrapper Model -- 2.1.3.3 Hybrid Model -- 2.2 Feature Selection for Clustering -- 2.2.1 Algorithms for Generic Data -- 2.2.1.1 Spectral Feature Selection (SPEC) -- 2.2.1.2 Laplacian Score (LS) -- 2.2.1.3 Feature Selection for Sparse Clustering -- 2.2.1.4 Localized Feature Selection Based on Scatter Separability (LFSBSS) -- 2.2.1.5 Multicluster Feature Selection (MCFS)
6.5.2.1 ENCLUS: Entropy-Based Approach -- 6.5.2.2 MAFIA: Adaptive Grids in High Dimensions -- 6.5.3 OptiGrid: Density-Based Optimal Grid Partitioning -- 6.5.4 Variants of the OptiGrid Approach -- 6.5.4.1 O-Cluster: A Scalable Approach -- 6.5.4.2 CBF: Cell-Based Filtering -- 6.6 Conclusions and Summary -- 7: Nonnegative Matrix Factorizations for Clustering: A Survey -- 7.1 Introduction -- 7.1.1 Background -- 7.1.2 NMF Formulations -- 7.2 NMF for Clustering: Theoretical Foundations -- 7.2.1 NMF and K-Means Clustering -- 7.2.2 NMF and Probabilistic Latent Semantic Indexing -- 7.2.3 NMF and Kernel K-Means andSpectral Clustering -- 7.2.4 NMF Boundedness Theorem -- 7.3 NMF Clustering Capabilities -- 7.3.1 Examples -- 7.3.2 Analysis -- 7.4 NMF Algorithms -- 7.4.1 Introduction -- 7.4.2 Algorithm Development -- 7.4.3 Practical Issues in NMF Algorithms -- 7.4.3.1 Initialization -- 7.4.3.2 Stopping Criteria -- 7.4.3.3 Objective Function vs. Clustering Performance -- 7.4.3.4 Scalability -- 7.5 NMF Related Factorizations -- 7.6 NMF for Clustering: Extensions -- 7.6.1 Co-Clustering -- 7.6.2 Semisupervised Clustering -- 7.6.3 Semisupervised Co-Clustering -- 7.6.4 Consensus Clustering -- 7.6.5 Graph Clustering -- 7.6.6 Other Clustering Extensions -- 7.7 Conclusions -- 8: Spectral Clustering -- 8.1 Introduction -- 8.2 Similarity Graph -- 8.3 Unnormalized Spectral Clustering -- 8.3.1 Notation -- 8.3.2 Unnormalized Graph Laplacian -- 8.3.3 Spectrum Analysis -- 8.3.4 Unnormalized Spectral Clustering Algorithm -- 8.4 Normalized Spectral Clustering -- 8.4.1 Normalized Graph Laplacian -- 8.4.2 Spectrum Analysis -- 8.4.3 Normalized Spectral Clustering Algorithm -- 8.5 Graph Cut View -- 8.5.1 Ratio Cut Relaxation -- 8.5.2 Normalized Cut Relaxation -- 8.6 Random Walks View -- 8.7 Connection to Laplacian Eigenmap
4.2.4.3 K-Modes Clustering -- 4.2.4.4 Fuzzy K-Means Clustering -- 4.2.4.5 X-Means Clustering -- 4.2.4.6 Intelligent K-Means Clustering -- 4.2.4.7 Bisecting K-Means Clustering -- 4.2.4.8 Kernel K-Means Clustering -- 4.2.4.9 Mean Shift Clustering -- 4.2.4.10 Weighted K-Means Clustering -- 4.2.4.11 Genetic K-Means Clustering -- 4.2.5 Making K-Means Faster -- 4.3 Hierarchical Clustering Algorithms -- 4.3.1 Agglomerative Clustering -- 4.3.1.1 Single and Complete Link -- 4.3.1.2 Group Averaged and Centroid Agglomerative Clustering -- 4.3.1.3 Ward's Criterion -- 4.3.1.4 Agglomerative Hierarchical Clustering Algorithm -- 4.3.1.5 Lance-Williams Dissimilarity Update Formula -- 4.3.2 Divisive Clustering -- 4.3.2.1 Issues in Divisive Clustering -- 4.3.2.2 Divisive Hierarchical Clustering Algorithm -- 4.3.2.3 Minimum Spanning Tree-Based Clustering -- 4.3.3 Other Hierarchical Clustering Algorithms -- 4.4 Discussion and Summary -- 5: Density-Based Clustering -- 5.1 Introduction -- 5.2 DBSCAN -- 5.3 DENCLUE -- 5.4 OPTICS -- 5.5 Other Algorithms -- 5.6 Subspace Clustering -- 5.7 Clustering Networks -- 5.8 Other Directions -- 5.9 Conclusion -- 6: Grid-Based Clustering -- 6.1 Introduction -- 6.2 The Classical Algorithms -- 6.2.1 Earliest Approaches: GRIDCLUS and BANG -- 6.2.2 STING and STING+: The Statistical Information Grid Approach -- 6.2.3 WaveCluster:Wavelets in Grid-Based Clustering -- 6.3 Adaptive Grid-Based Algorithms -- 6.3.1 AMR: Adaptive Mesh Refinement Clustering -- 6.4 Axis-Shifting Grid-Based Algorithms -- 6.4.1 NSGC: New Shifting Grid Clustering Algorithm -- 6.4.2 ADCC: Adaptable Deflect and Conquer Clustering -- 6.4.3 ASGC: Axis-Shifted Grid-Clustering -- 6.4.4 GDILC: Grid-Based Density-IsoLine Clustering Algorithm -- 6.5 High-Dimensional Algorithms -- 6.5.1 CLIQUE: The Classical High-Dimensional Algorithm -- 6.5.2 Variants of CLIQUE
8.8 Connection to Kernel k-Means and Nonnegative Matrix Factorization -- 8.9 Large Scale Spectral Clustering -- 8.10 Further Reading -- 9: Clustering High-Dimensional Data -- 9.1 Introduction -- 9.2 The "Curse of Dimensionality" -- 9.2.1 Different Aspects of the "Curse" -- 9.2.2 Consequences -- 9.3 Clustering Tasks in Subspaces of High-Dimensional Data -- 9.3.1 Categories of Subspaces -- 9.3.1.1 Axis-Parallel Subspaces -- 9.3.1.2 Arbitrarily Oriented Subspaces -- 9.3.1.3 Special Cases -- 9.3.2 Search Spaces for the Clustering Problem -- 9.4 Fundamental Algorithmic Ideas -- 9.4.1 Clustering in Axis-Parallel Subspaces -- 9.4.1.1 Cluster Model -- 9.4.1.2 Basic Techniques -- 9.4.1.3 Clustering Algorithms -- 9.4.2 Clustering in Arbitrarily Oriented Subspaces -- 9.4.2.1 Cluster Model -- 9.4.2.2 Basic Techniques and Example Algorithms -- 9.5 Open Questions and Current Research Directions -- 9.6 Conclusion -- 10: A Survey of Stream Clustering Algorithms -- 10.1 Introduction -- 10.2 Methods Based on Partitioning Representatives -- 10.2.1 The STREAM Algorithm -- 10.2.2 CluStream: The Microclustering Framework -- 10.2.2.1 Microcluster Definition -- 10.2.2.2 Pyramidal Time Frame -- 10.2.2.3 Online Clustering with CluStream -- 10.3 Density-Based Stream Clustering -- 10.3.1 DenStream: Density-Based Microclustering -- 10.3.2 Grid-Based Streaming Algorithms -- 10.3.2.1 D-Stream Algorithm -- 10.3.2.2 Other Grid-Based Algorithms -- 10.4 Probabilistic Streaming Algorithms -- 10.5 Clustering High-Dimensional Streams -- 10.5.1 The HPSTREAM Method -- 10.5.2 Other High-Dimensional Streaming Algorithms -- 10.6 Clustering Discrete and Categorical Streams -- 10.6.1 Clustering Binary Data Streams with k-Means -- 10.6.2 The StreamCluCD Algorithm -- 10.6.3 Massive-Domain Clustering -- 10.7 Text Stream Clustering -- 10.8 Other Scenarios for Stream Clustering
2.2.1.6 FeatureWeighting k-Means -- 2.2.2 Algorithms for Text Data -- 2.2.2.1 Term Frequency (TF) -- 2.2.2.2 Inverse Document Frequency (IDF) -- 2.2.2.3 Term Frequency-Inverse Document Frequency (TF-IDF) -- 2.2.2.4 Chi Square Statistic -- 2.2.2.5 Frequent Term-Based Text Clustering -- 2.2.2.6 Frequent Term Sequence -- 2.2.3 Algorithms for Streaming Data -- 2.2.3.1 Text Stream Clustering Based on Adaptive Feature Selection (TSC-AFS) -- 2.2.3.2 High-Dimensional Projected Stream Clustering (HPStream) -- 2.2.4 Algorithms for Linked Data -- 2.2.4.1 Challenges and Opportunities -- 2.2.4.2 LUFS: An Unsupervised Feature Selection Framework for Linked Data -- 2.2.4.3 Conclusion and Future Work for Linked Data -- 2.3 Discussions and Challenges -- 2.3.1 The Chicken or the Egg Dilemma -- 2.3.2 Model Selection: K and l -- 2.3.3 Scalability -- 2.3.4 Stability -- 3: Probabilistic Models for Clustering -- 3.1 Introduction -- 3.2 Mixture Models -- 3.2.1 Overview -- 3.2.2 Gaussian Mixture Model -- 3.2.3 Bernoulli Mixture Model -- 3.2.4 Model Selection Criteria -- 3.3 EM Algorithm and Its Variations -- 3.3.1 The General EM Algorithm -- 3.3.2 Mixture Models Revisited -- 3.3.3 Limitations of the EM Algorithm -- 3.3.4 Applications of the EM Algorithm -- 3.4 Probabilistic Topic Models -- 3.4.1 Probabilistic Latent Semantic Analysis -- 3.4.2 Latent Dirichlet Allocation -- 3.4.3 Variations and Extensions -- 3.5 Conclusions and Summary -- 4: A Survey of Partitional and Hierarchical Clustering Algorithms -- 4.1 Introduction -- 4.2 Partitional Clustering Algorithms -- 4.2.1 K-Means Clustering -- 4.2.2 Minimization of Sum of Squared Errors -- 4.2.3 Factors Affecting K-Means -- 4.2.3.1 Popular Initialization Methods -- 4.2.3.2 Estimating the Number of Clusters -- 4.2.4 Variations of K-Means -- 4.2.4.1 K-Medoids Clustering -- 4.2.4.2 K-Medians Clustering