A Cheap Feature Selection Approach for the K-Means Algorithm

The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision, or sensor networks, represents a challenge for the <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-mean...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transaction on neural networks and learning systems Vol. 32; no. 5; pp. 2195 - 2208
Main Authors	Capo, Marco, Perez, Aritz, Lozano, Jose A.
Format	Journal Article
Language	English
Published	United States IEEE 01.05.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	<italic xmlns:ali="http://www.niso.org/schemas/ali/1.0/" xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">K -means clustering Algorithms Approximation algorithms Clustering Clustering algorithms Computational efficiency Computer applications Computer vision Computing costs Computing time Dimensionality reduction Errors Feature extraction Feature selection Gene sequencing Genomes Heuristic algorithms Parallel processing parallelization Partitioning algorithms Proposals Theoretical analysis unsupervised learning
Online Access	Get full text
ISSN	2162-237X 2162-2388 2162-2388
DOI	10.1109/TNNLS.2020.3002576

Cover

More Information
Summary:	The increase in the number of features that need to be analyzed in a wide variety of areas, such as genome sequencing, computer vision, or sensor networks, represents a challenge for the <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-means algorithm. In this regard, different dimensionality reduction approaches for the <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-means algorithm have been designed recently, leading to algorithms that have proved to generate competitive clusterings. Unfortunately, most of these techniques tend to have fairly high computational costs and/or might not be easy to parallelize. In this article, we propose a fully parallelizable feature selection technique intended for the <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-means algorithm. The proposal is based on a novel feature relevance measure that is closely related to the <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-means error of a given clustering. Given a disjoint partition of the features, the technique consists of obtaining a clustering for each subset of features and selecting the <inline-formula> <tex-math notation="LaTeX">m </tex-math></inline-formula> features with the highest relevance measure. The computational cost of this approach is just <inline-formula> <tex-math notation="LaTeX">\mathcal {O}(m\cdot \max \{n\cdot K,\log m\}) </tex-math></inline-formula> per subset of features. We additionally provide a theoretical analysis on the quality of the obtained solution via our proposal and empirically analyze its performance with respect to well-known feature selection and feature extraction techniques. Such an analysis shows that our proposal consistently obtains the results with lower <inline-formula> <tex-math notation="LaTeX">K </tex-math></inline-formula>-means error than all the considered feature selection techniques: Laplacian scores, maximum variance, multicluster feature selection, and random selection while also requiring similar or lower computational times than these approaches. Moreover, when compared with feature extraction techniques, such as random projections, the proposed approach also shows a noticeable improvement in both error and computational time.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2162-237X 2162-2388 2162-2388
DOI:	10.1109/TNNLS.2020.3002576