Multi-Fault Tolerance for Cartesian Data Distributions

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than rep...

Full description

Saved in:

Bibliographic Details
Published in	International journal of parallel programming Vol. 41; no. 3; pp. 469 - 493
Main Authors	Ali, Nawab, Krishnamoorthy, Sriram, Halappanavar, Mahantesh, Daily, Jeff
Format	Journal Article
Language	English
Published	Boston Springer US 01.06.2013 Springer Nature B.V
Subjects	Algorithms Blocking Computer Science Failure Fault tolerance Faults Handles Linear algebra Parallel processing Parity Processor Architectures Processors Software Engineering/Programming and Operating Systems Sparsity Studies Theory of Computation Data distribution Fault tolerance Checksums Fault tolerant linear algebra
Online Access	Get full text
ISSN	0885-7458 1573-7640
DOI	10.1007/s10766-012-0218-5

Cover

More Information
Summary:	Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23
ISSN:	0885-7458 1573-7640
DOI:	10.1007/s10766-012-0218-5