Revisiting Fragmentation for Deduplication in Clustered Primary Storage Systems

To improve storage efficiency in large-scale clustered storage systems, deduplication that removes duplicate chunks has been widely deployed in distributed ways. Many distributed deduplication-related studies focus on backup storage, and some recent studies focus on deploying deduplication in cluste...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 01 - 12
Main Authors	Wang, Lin, Hu, Yuchong, Mao, Shilong, Li, Mingqi, Duan, Ziling, Huang, Yue, Qin, Leihua, Feng, Dan, Chen, Zehui, Dong, Ruliang
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Cluster computing Clustering algorithms Distributed databases Heuristic algorithms Memory management Redundancy Throughput Time complexity
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186466

Cover

More Information
Summary:	To improve storage efficiency in large-scale clustered storage systems, deduplication that removes duplicate chunks has been widely deployed in distributed ways. Many distributed deduplication-related studies focus on backup storage, and some recent studies focus on deploying deduplication in clustered primary storage systems which store active data. While fragmentation is one of the traditional challenges in backup deduplication, we observe that a new fragmentation problem arises when performing deduplication in the clustered primary storage system due to the system's concurrent file writes. However, we find that existing state-of-the-art methods that address traditional fragmentation in backup deduplication fail to work effectively for the new fragmentation problem, as they significantly incur additional redundancy or lower the deduplication ratio. In this paper, we revisit fragmentation-solving methods in memory management and our main idea is inspired by the classic garbage collection methods in memory management: relocating fragments consecutively. Based on the idea, we propose an effective deduplication mechanism for clustered primary storage systems, ReoDedup, which applies: i) a cosine-similarity based chunk relocating algorithm that aims to minimize the fragmentation; ii) an adjacency-table based relocating heuristic that reduces the relocating's time complexity by placing two chunks residing in the same file consecutively; and iii) an indexremapping update scheme that alleviates the extra fragmentation caused by updates. We implement ReoDedup atop Ceph and our cloud experiments show that the average read throughput of ReoDedup can be increased by up to 1.72 \times over state-of-thearts, without any deduplication ratio loss.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186466