Small Files Problem Resolution via Hierarchical Clustering Algorithm

The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory a...

Full description

Saved in:

Bibliographic Details
Published in	Big data Vol. 12; no. 3; p. 229
Main Authors	Koren, Oded, Shamalov, Aviel, Perel, Nir
Format	Journal Article
Language	English
Published	United States 01.06.2024
Subjects	Algorithms Cluster Analysis Machine Learning Big Data HDFS hierarchical clustering machine learning blocks small files
Online Access	Get more information
ISSN	2167-647X
DOI	10.1089/big.2022.0181

Cover

More Information
Summary:	The Small Files Problem in Hadoop Distributed File System (HDFS) is an ongoing challenge that has not yet been solved. However, various approaches have been developed to tackle the obstacles this problem creates. Properly managing the size of blocks in a file system is essential as it saves memory and computing time and may reduce bottlenecks. In this article, a new approach using a Hierarchical Clustering Algorithm is suggested for dealing with small files. The proposed method identifies the files by their structure and via a special Dendrogram analysis, and then recommends which files can be merged. As a simulation, the proposed algorithm was applied via 100 CSV files with different structures, containing 2-4 columns with different data types (integer, decimal and text). Also, 20 files that were not CSV files were created to demonstrate that the algorithm only works on CSV files. All data were analyzed via a machine learning hierarchical clustering method, and a Dendrogram was created. According to the merge process that was performed, seven files from the Dendrogram analysis were chosen as appropriate files to be merged. This reduced the memory space in the HDFS. Furthermore, the results showed that using the suggested algorithm led to efficient file management.
ISSN:	2167-647X
DOI:	10.1089/big.2022.0181