Single Disk Failure Recovery for X-Code-Based Parallel Storage Systems

In modern parallel storage systems (e.g., cloud storage and data centers), it is important to provide data availability guarantees against disk (or storage node) failures via redundancy coding schemes. One coding scheme is X-code, which is double-fault tolerant while achieving the optimal update com...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computers Vol. 63; no. 4; pp. 995 - 1007
Main Authors Silei Xu, Runhui Li, Lee, Patrick P. C., Yunfeng Zhu, Liping Xiang, Yinlong Xu, Lui, John C. S.
Format Journal Article
LanguageEnglish
Published IEEE 01.04.2014
Subjects
Online AccessGet full text
ISSN0018-9340
DOI10.1109/TC.2013.8

Cover

More Information
Summary:In modern parallel storage systems (e.g., cloud storage and data centers), it is important to provide data availability guarantees against disk (or storage node) failures via redundancy coding schemes. One coding scheme is X-code, which is double-fault tolerant while achieving the optimal update complexity. When a disk/node fails, recovery must be carried out to reduce the possibility of data unavailability. We propose an X-code-based optimal recovery scheme called minimum-disk-read-recovery (MDRR), which minimizes the number of disk reads for single-disk failure recovery. We make several contributions. First, we show that MDRR provides optimal single-disk failure recovery and reduces about 25 percent of disk reads compared to the conventional recovery approach. Second, we prove that any optimal recovery scheme for X-code cannot balance disk reads among different disks within a single stripe in general cases. Third, we propose an efficient logical encoding scheme that issues balanced disk read in a group of stripes for any recovery algorithm (including the MDRR scheme). Finally, we implement our proposed recovery schemes and conduct extensive testbed experiments in a networked storage system prototype. Experiments indicate that MDRR reduces around 20 percent of recovery time of the conventional approach, showing that our theoretical findings are applicable in practice.
ISSN:0018-9340
DOI:10.1109/TC.2013.8