RAN: Accelerating Data Repair with Available Nodes in Erasure-Coded Storage

Distributed storage systems ensure data availability through fault-tolerant mechanisms, with erasure coding widely adopted for its low storage overhead. However, erasure coding generates significant repair traffic during data recovery, severely degrading performance. Recent repair algorithms aim to...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 12
Main Authors	Yang, Canghai, Zhong, Kan, Tan, Yujuan, Ren, Ao, Liu, Duo
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Distributed Storage Downlink Encoding Erasure Coding Fault tolerance Fault tolerant systems Maintenance engineering Network Transfer Performance evaluation Prototypes Repair Algorithms Systematics Throughput Uplink
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186483

Cover

More Information
Summary:	Distributed storage systems ensure data availability through fault-tolerant mechanisms, with erasure coding widely adopted for its low storage overhead. However, erasure coding generates significant repair traffic during data recovery, severely degrading performance. Recent repair algorithms aim to alleviate network bottlenecks at congested nodes, but they primarily address downlink bottlenecks while neglecting uplink constraints, which fundamentally limit repair efficiency. Furthermore, these algorithms lack a systematic approach for handling diverse failure scenarios, complicating recover implementation. In this paper, we propose RAN, an aggregation-based repair algorithm that alleviates both uplink and downlink bottlenecks by optimizing bandwidth utilization across all available nodes and aggregating network transfers via programmable network devices. Additionally, RAN systematically maximizes repair performance across diverse failure scenarios through a unified procedure. Experiments on Amazon EC2 show that RAN improves repair throughput by up to \mathbf{6 8. 9 \%} for degraded read and \mathbf{2 6 6. 6 \%} for full-node recovery compared to state-of-the-art algorithms.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186483