RAN: Accelerating Data Repair with Available Nodes in Erasure-Coded Storage

Distributed storage systems ensure data availability through fault-tolerant mechanisms, with erasure coding widely adopted for its low storage overhead. However, erasure coding generates significant repair traffic during data recovery, severely degrading performance. Recent repair algorithms aim to...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE International Conference on Cluster Computing pp. 1 - 12
Main Authors Yang, Canghai, Zhong, Kan, Tan, Yujuan, Ren, Ao, Liu, Duo
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.09.2025
Subjects
Online AccessGet full text
ISSN2168-9253
DOI10.1109/CLUSTER59342.2025.11186483

Cover

More Information
Summary:Distributed storage systems ensure data availability through fault-tolerant mechanisms, with erasure coding widely adopted for its low storage overhead. However, erasure coding generates significant repair traffic during data recovery, severely degrading performance. Recent repair algorithms aim to alleviate network bottlenecks at congested nodes, but they primarily address downlink bottlenecks while neglecting uplink constraints, which fundamentally limit repair efficiency. Furthermore, these algorithms lack a systematic approach for handling diverse failure scenarios, complicating recover implementation. In this paper, we propose RAN, an aggregation-based repair algorithm that alleviates both uplink and downlink bottlenecks by optimizing bandwidth utilization across all available nodes and aggregating network transfers via programmable network devices. Additionally, RAN systematically maximizes repair performance across diverse failure scenarios through a unified procedure. Experiments on Amazon EC2 show that RAN improves repair throughput by up to \mathbf{6 8. 9 \%} for degraded read and \mathbf{2 6 6. 6 \%} for full-node recovery compared to state-of-the-art algorithms.
ISSN:2168-9253
DOI:10.1109/CLUSTER59342.2025.11186483