Multi-Fault Tolerance for Cartesian Data Distributions

Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than rep...

Full description

Saved in:
Bibliographic Details
Published inInternational journal of parallel programming Vol. 41; no. 3; pp. 469 - 493
Main Authors Ali, Nawab, Krishnamoorthy, Sriram, Halappanavar, Mahantesh, Daily, Jeff
Format Journal Article
LanguageEnglish
Published Boston Springer US 01.06.2013
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN0885-7458
1573-7640
DOI10.1007/s10766-012-0218-5

Cover

Abstract Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
AbstractList Issue Title: Special Issue: Computing Frontiers Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.[PUBLICATION ABSTRACT]
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.
Author Daily, Jeff
Halappanavar, Mahantesh
Krishnamoorthy, Sriram
Ali, Nawab
Author_xml – sequence: 1
  givenname: Nawab
  surname: Ali
  fullname: Ali, Nawab
  organization: Pacific Northwest National Laboratory
– sequence: 2
  givenname: Sriram
  surname: Krishnamoorthy
  fullname: Krishnamoorthy, Sriram
  email: sriram@pnnl.gov
  organization: Pacific Northwest National Laboratory
– sequence: 3
  givenname: Mahantesh
  surname: Halappanavar
  fullname: Halappanavar, Mahantesh
  organization: Pacific Northwest National Laboratory
– sequence: 4
  givenname: Jeff
  surname: Daily
  fullname: Daily, Jeff
  organization: Pacific Northwest National Laboratory
BookMark eNpdkL1OwzAYRS1UJNrCA7BFYmExfP53RtRSQCpiKbNlJw5KFexiO-9PqnZATHe4R1dXZ4FmIQaP0C2BBwKgHjMBJSUGQjFQorG4QHMiFMNKcpihOWgtsOJCX6FFznsAqJXWcyTfx6H0eGOnqHZx8MmGxlddTNXKpuJzb0O1tsVW6z6X1Lux9DHka3TZ2SH7m3Mu0efmebd6xduPl7fV0xZHSmTBbU1rJyXV0DS8A6EkUGedA-eE5q2QpOHOd15IJgWbWuW5blvdEg91LTRbovvT7iHFn9HnYr773PhhsMHHMRvCKKMUlIYJvfuH7uOYwvTOEKo1IZwrPlH0ROVD6sOXT38oMEeV5qTSTCrNUaUR7Be6lGZu
CODEN IJPPE5
ContentType Journal Article
Copyright Springer Science+Business Media New York 2012
Springer Science+Business Media New York 2013
Copyright_xml – notice: Springer Science+Business Media New York 2012
– notice: Springer Science+Business Media New York 2013
DBID 0U~
1-H
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8FD
8FE
8FG
8FK
8FL
8G5
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FRNLG
F~G
GNUQQ
GUQSH
HCIFZ
JQ2
K60
K6~
K7-
L.-
L.0
L7M
L~C
L~D
M0C
M0N
M2O
MBDVC
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
Q9U
DOI 10.1007/s10766-012-0218-5
DatabaseName Global News & ABI/Inform Professional
Trade PRO
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Global (Alumni Edition)
Computing Database (Alumni Edition)
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni Edition)
Research Library
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Database‎ (1962 - current)
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One Community College
ProQuest Central
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
Research Library Prep
SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database (Proquest)
ABI/INFORM Professional Advanced
ABI/INFORM Professional Standard
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Research Library
Research Library (Corporate)
Advanced Technologies & Aerospace Collection
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central Basic
DatabaseTitle ProQuest Business Collection (Alumni Edition)
Research Library Prep
Computer Science Database
ProQuest Central Student
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
SciTech Premium Collection
Trade PRO
ABI/INFORM Complete
Global News & ABI/Inform Professional
ProQuest One Applied & Life Sciences
ProQuest Central (New)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest Business Collection
ProQuest One Academic UKI Edition
ProQuest One Academic
ProQuest One Academic (New)
ABI/INFORM Global (Corporate)
ProQuest One Business
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Central (Alumni Edition)
ProQuest One Community College
Research Library (Alumni Edition)
ProQuest Central
ABI/INFORM Professional Advanced
ABI/INFORM Professional Standard
ProQuest Central Korea
ProQuest Research Library
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest SciTech Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Business (Alumni)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList ProQuest Business Collection (Alumni Edition)

Computer and Information Systems Abstracts
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 1573-7640
EndPage 493
ExternalDocumentID 2893352861
10_1007_s10766_012_0218_5
Genre Feature
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
-~X
.4S
.86
.DC
.VR
06D
0R~
0VY
199
1N0
2.D
203
28-
29J
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5QI
5VS
67Z
6NX
78A
7WY
8FE
8FG
8FL
8G5
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYJJ
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDBF
ABDPE
ABDZT
ABECU
ABFSI
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTAH
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFO
ACGFS
ACHSB
ACHXU
ACIHN
ACKNC
ACMDZ
ACMLO
ACNCT
ACOKC
ACOMO
ACPIV
ACREN
ACUHS
ACZOJ
ADHIR
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEAQA
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AOCGG
ARAPS
ARCSS
ARMRJ
AXYYD
AYJHY
AZFZN
AZQEC
B-.
B0M
BA0
BBWZM
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BKOMP
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
E.L
EAD
EAP
EAS
EBLON
EBS
EDO
EIOEI
EJD
EMK
EPL
ESBYG
ESX
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GROUPED_ABI_INFORM_RESEARCH
GUQSH
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
H~9
I-F
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
KOW
LAK
LLZTM
M0C
M0N
M2O
M4Y
MA-
MS~
N2Q
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
Q2X
QOK
QOS
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TAE
TEORI
TN5
TSG
TSK
TSV
TUC
TUS
U2A
U5U
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
VXZ
W23
W48
WH7
WK8
YLTOR
Z45
Z7R
Z7X
Z81
Z83
Z88
Z8R
Z8W
Z92
ZMTXR
ZY4
~8M
~EX
0U~
1-H
7SC
7XB
8AL
8FD
8FK
AAPKM
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
AEZWR
AFDZB
AFHIU
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
JQ2
L.-
L.0
L7M
L~C
L~D
MBDVC
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQUKI
Q9U
PUEGO
ID FETCH-LOGICAL-o216t-d929b66280cc4f057602babb0bb584d561c4befe5636537607e48dd8d1e099583
IEDL.DBID BENPR
ISSN 0885-7458
IngestDate Wed Oct 01 15:07:43 EDT 2025
Fri Jul 25 23:32:18 EDT 2025
Fri Feb 21 02:37:22 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Keywords Data distribution
Fault tolerance
Checksums
Fault tolerant linear algebra
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-o216t-d929b66280cc4f057602babb0bb584d561c4befe5636537607e48dd8d1e099583
Notes SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-2
content type line 23
PQID 1288114474
PQPubID 48389
PageCount 25
ParticipantIDs proquest_miscellaneous_1323220780
proquest_journals_1288114474
springer_journals_10_1007_s10766_012_0218_5
PublicationCentury 2000
PublicationDate 20130600
20130601
PublicationDateYYYYMMDD 2013-06-01
PublicationDate_xml – month: 6
  year: 2013
  text: 20130600
PublicationDecade 2010
PublicationPlace Boston
PublicationPlace_xml – name: Boston
– name: New York
PublicationTitle International journal of parallel programming
PublicationTitleAbbrev Int J Parallel Prog
PublicationYear 2013
Publisher Springer US
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer Nature B.V
References Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu
ChenZ.DongarraJ.Algorithm-based fault tolerance for fail-stop failuresIEEE Trans. Parallel Distrib. Syst.200819121628164110.1109/TPDS.2008.58
Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)
HuangK.-H.AbrahamJ.A.Algorithm-based fault tolerance for matrix operationsIEEE Trans. Comput.19843365185280557.6802710.1109/TC.1984.1676475
NieplochaJ.PalmerB.TipparajuV.KrishnanM.TreaseH.Aprà àE.Advances, applications and performance of the global arrays shared memory programming toolkitInt. J. High Perform. Comput. Appl.20062020323110.1177/1094342006064503
Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)
Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)
Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)
Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)
Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)
ElnozahyE.N.AlvisiL.WangY.-M.JohnsonD.B.A survey of rollback-recovery protocols in message-passing systemsACM Comput. Surv.200234337540810.1145/568522.568525
SchrijverA.Combinatorial Optimization: Polyhedra and Efficiency2003New YorkSpringer Publishing Co.1041.90001
Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)
Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)
LovaszL.PlummerM.D.Matching Theory1986AmsterdamNorth-Holland Publishing Co.0618.05001
Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)
Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)
Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)
LawlerE.Combinatorial Optimization: Networks and Matroids2001MineolaDover Publications1058.90057
Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)
GabowH.N.An efficient implementation of edmonds’ algorithm for maximum matching on graphsJ. ACM19762322212344059230327.0512110.1145/321941.321942
Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)
ValievM.BylaskaE.GovindN.KowalskiK.StraatsmaT.DamH.V.WangD.NieplochaJ.ApraE.WindusT.de JongW.NWChem: a comprehensive and scalable open-source solution for large scale molecular simulationsComput. Phys. Commun.20101819147714891216.8117910.1016/j.cpc.2010.04.018
Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
HopcroftJ.KarpR.A \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${n^{\frac{5}{2}}}$$\end{document} algorithm for maximum matchings in bipartite graphsSIAM J. Comput.197322252313376990266.0511410.1137/0202019
SchroederB.GibsonG.A.Understanding failures in petascale computersJ. Phys. Conf. Ser.2007781111
The ScaLAPACK project. http://www.netlib.org/scalapack
Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)
WolseyL.A.Integer Programming1998HobokenWiley0930.90072
HPL. http://www.netlib.org/benchmark/hpl
BosilcaG.DelmasR.DongarraJ.LangouJ.Algorithm-based fault tolerance applied to high performance computingJ. Parallel Distrib. Comput.200969441041610.1016/j.jpdc.2008.12.002
HargroveP.H.DuellJ.C.Berkeley lab checkpoint/restart (BLCR) for Linux clustersJ. Phys. Conf. Ser.200646149449910.1088/1742-6596/46/1/067
KuhnH.W.The Hungarian method for the assignment problemNaval Res. Logist. Q.19552839710.1002/nav.3800020109
PlankJ.S.LiK.PueningM.A.Diskless checkpointingIEEE Trans. Parallel Distrib. Syst.199891097298610.1109/71.730527
BurkardR.Dell’AmicoM.MartelloS.Assignment Problems2009PhiladelphiaSociety for Industrial and Applied Mathematics1196.9000210.1137/1.9780898717754
MotwaniR.Average-case analysis of algorithms for matchings and related problemsJ. ACM19944161329135613715020829.6807010.1145/195613.195663
PapadimitriouC.H.SteiglitzK.Combinatorial Optimization: Algorithms and Complexity1982Upper Saddle RiverPrentice-Hall Inc.0503.90060
Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)
DarteA.Mellor-CrummeyJ.FowlerR.Chavarría-MirandaD.Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computationsJ. Parallel Distrib. Comput.20036398879111047.6816410.1016/S0743-7315(03)00103-5
Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)
References_xml – reference: PlankJ.S.LiK.PueningM.A.Diskless checkpointingIEEE Trans. Parallel Distrib. Syst.199891097298610.1109/71.730527
– reference: Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995)
– reference: BosilcaG.DelmasR.DongarraJ.LangouJ.Algorithm-based fault tolerance applied to high performance computingJ. Parallel Distrib. Comput.200969441041610.1016/j.jpdc.2008.12.002
– reference: Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006)
– reference: Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004)
– reference: The ScaLAPACK project. http://www.netlib.org/scalapack
– reference: Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008)
– reference: NieplochaJ.PalmerB.TipparajuV.KrishnanM.TreaseH.Aprà àE.Advances, applications and performance of the global arrays shared memory programming toolkitInt. J. High Perform. Comput. Appl.20062020323110.1177/1094342006064503
– reference: Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007)
– reference: ChenZ.DongarraJ.Algorithm-based fault tolerance for fail-stop failuresIEEE Trans. Parallel Distrib. Syst.200819121628164110.1109/TPDS.2008.58
– reference: Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009)
– reference: Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c)
– reference: Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu
– reference: Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007)
– reference: HuangK.-H.AbrahamJ.A.Algorithm-based fault tolerance for matrix operationsIEEE Trans. Comput.19843365185280557.6802710.1109/TC.1984.1676475
– reference: LovaszL.PlummerM.D.Matching Theory1986AmsterdamNorth-Holland Publishing Co.0618.05001
– reference: PapadimitriouC.H.SteiglitzK.Combinatorial Optimization: Algorithms and Complexity1982Upper Saddle RiverPrentice-Hall Inc.0503.90060
– reference: GabowH.N.An efficient implementation of edmonds’ algorithm for maximum matching on graphsJ. ACM19762322212344059230327.0512110.1145/321941.321942
– reference: MotwaniR.Average-case analysis of algorithms for matchings and related problemsJ. ACM19944161329135613715020829.6807010.1145/195613.195663
– reference: Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009)
– reference: Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009)
– reference: Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011)
– reference: Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009)
– reference: ValievM.BylaskaE.GovindN.KowalskiK.StraatsmaT.DamH.V.WangD.NieplochaJ.ApraE.WindusT.de JongW.NWChem: a comprehensive and scalable open-source solution for large scale molecular simulationsComput. Phys. Commun.20101819147714891216.8117910.1016/j.cpc.2010.04.018
– reference: Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009)
– reference: SchroederB.GibsonG.A.Understanding failures in petascale computersJ. Phys. Conf. Ser.2007781111
– reference: HargroveP.H.DuellJ.C.Berkeley lab checkpoint/restart (BLCR) for Linux clustersJ. Phys. Conf. Ser.200646149449910.1088/1742-6596/46/1/067
– reference: KuhnH.W.The Hungarian method for the assignment problemNaval Res. Logist. Q.19552839710.1002/nav.3800020109
– reference: ElnozahyE.N.AlvisiL.WangY.-M.JohnsonD.B.A survey of rollback-recovery protocols in message-passing systemsACM Comput. Surv.200234337540810.1145/568522.568525
– reference: Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a)
– reference: LawlerE.Combinatorial Optimization: Networks and Matroids2001MineolaDover Publications1058.90057
– reference: Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b)
– reference: HopcroftJ.KarpR.A \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${n^{\frac{5}{2}}}$$\end{document} algorithm for maximum matchings in bipartite graphsSIAM J. Comput.197322252313376990266.0511410.1137/0202019
– reference: SchrijverA.Combinatorial Optimization: Polyhedra and Efficiency2003New YorkSpringer Publishing Co.1041.90001
– reference: DarteA.Mellor-CrummeyJ.FowlerR.Chavarría-MirandaD.Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computationsJ. Parallel Distrib. Comput.20036398879111047.6816410.1016/S0743-7315(03)00103-5
– reference: Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004)
– reference: Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000)
– reference: HPL. http://www.netlib.org/benchmark/hpl
– reference: Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994)
– reference: BurkardR.Dell’AmicoM.MartelloS.Assignment Problems2009PhiladelphiaSociety for Industrial and Applied Mathematics1196.9000210.1137/1.9780898717754
– reference: WolseyL.A.Integer Programming1998HobokenWiley0930.90072
SSID ssj0009788
Score 1.9425082
Snippet Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems....
Issue Title: Special Issue: Computing Frontiers Faults are expected to play an increasingly important role in how algorithms and applications are designed to...
SourceID proquest
springer
SourceType Aggregation Database
Publisher
StartPage 469
SubjectTerms Algorithms
Blocking
Computer Science
Failure
Fault tolerance
Faults
Handles
Linear algebra
Parallel processing
Parity
Processor Architectures
Processors
Software Engineering/Programming and Operating Systems
Sparsity
Studies
Theory of Computation
SummonAdditionalLinks – databaseName: SpringerLink Journals (ICM)
  dbid: U2A
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ07T8MwEIBPUBYW3ohCQUZiQ5Zix3bcsWqJKiSYWqlbZCfOhBKJpv-fuzQpBbEwZYhzw52d--zz3QE8haCkil3Cx3QOpkotuENjcCFMobwWtvB03vH2buZL9brSqy6Pe93fdu9Dku2fei_ZLTG0-5Wc_BLXh3CkqZoXTuKlnHxX2k3aZpO4ejRPlLZ9KPMvET-w8lcktHUw6RmcdGTIJltTnsNBqC7gtO-6wLpFeAmmzZnlqcMHW9QfgVpjBIbwyaZ0QZOyItnMNY7NqChu189qfQXL9GUxnfOu-wGvpTANLxBcvDHSRnmuSsQqE0nvvI-8R2gokHty5UMZtIkN1WSJkqBsUdhCBKQ-beNrGFR1FW6A2Vz6ErnMK4XAkGvrcRWWY9wqBEEF6ocw6tWQdVN4naHjsrhZwi-G8Lh7jZOPIgquCvUGx8QIZBIpA0U89-rbE7EriUx6z1DvGek907f_Gn0Hx3LbhAJNN4JB87kJ94gCjX9oTf8FYQKpTg
  priority: 102
  providerName: Springer Nature
Title Multi-Fault Tolerance for Cartesian Data Distributions
URI https://link.springer.com/article/10.1007/s10766-012-0218-5
https://www.proquest.com/docview/1288114474
https://www.proquest.com/docview/1323220780
Volume 41
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: EBSCOhost Academic Search Ultimate
  customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn
  eissn: 1573-7640
  dateEnd: 20241031
  omitProxy: true
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: ABDBF
  dateStart: 20030201
  isFulltext: true
  titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn
  providerName: EBSCOhost
– providerCode: PRVEBS
  databaseName: Inspec with Full Text
  customDbUrl:
  eissn: 1573-7640
  dateEnd: 20241031
  omitProxy: false
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: ADMLS
  dateStart: 19970201
  isFulltext: true
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  providerName: EBSCOhost
– providerCode: PRVLSH
  databaseName: SpringerLink Journals
  customDbUrl:
  mediaType: online
  eissn: 1573-7640
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: AFBBN
  dateStart: 19970201
  isFulltext: true
  providerName: Library Specific Holdings
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl: http://www.proquest.com/pqcentral?accountid=15518
  eissn: 1573-7640
  dateEnd: 20171231
  omitProxy: true
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: BENPR
  dateStart: 19970201
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Technology Collection
  customDbUrl:
  eissn: 1573-7640
  dateEnd: 20241031
  omitProxy: true
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: 8FG
  dateStart: 19970201
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/technologycollection1
  providerName: ProQuest
– providerCode: PRVAVX
  databaseName: SpringerLINK - Czech Republic Consortium
  customDbUrl:
  eissn: 1573-7640
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: AGYKE
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: http://link.springer.com
  providerName: Springer Nature
– providerCode: PRVAVX
  databaseName: SpringerLink Journals (ICM)
  customDbUrl:
  eissn: 1573-7640
  dateEnd: 99991231
  omitProxy: true
  ssIdentifier: ssj0009788
  issn: 0885-7458
  databaseCode: U2A
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: http://www.springerlink.com/journals/
  providerName: Springer Nature
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3JTsMwEB11uXBhRxRKFSRuyKJxbMc9IFToJpYKAZXKKYpj54QaoO3_M5MmZTlw8sGRI82M7WeP5z2AM-cEF0Ecsg7dg4lU-ixGZzDfV1YY6Wtr6L7jYaxGE3E7ldMKjMtaGHpWWa6J-UJts4TuyC9wHdWI3UUort4_GKlGUXa1lNCIC2kFe5lTjFWhzokZqwb16_748embhjfMlShxakkWCqnLPOeqmC5UdLrmjPY9Jn9hzj9p0nz3GWzDZgEbve7KzztQcbNd2ColGbxihu6Bygtq2SDGxnvJ3hzpZjgPkal3Q683qWTS68WL2OsRY24hdjXfh8mg_3IzYoU0Asu4rxbMIqoxSnHdThKRIuZSbW5iY9rGIKKwCIoSYVzqpAoUEba0Qye0tdr6DiGh1MEB1GbZzB2CpxNuUgRthmwbJFIbnKJpB88Rzif2-gY0SzNERXzPo29vNOB03Y2RSemGeOayJX4TIFrjCEFwiPPSfD-GWPMlk90jtHtEdo_k0f8_PIYNvpKkQF81obb4XLoTBAYL04KqHgxbUO_2Hu6fqR2-3vVbRQxg74R3vwA0arhL
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT8MwDLZ4HODCGzEYECQ4oYg1TdLsMCFgTOM1ITQkbqVp0hNagQ0h_hy_DbtreR24ceqhVSrZif3Zjv0B7HovhQyTiDcpDyYzFfAElcGDQDtpVWCcpXzHVU93b-X5nbqbgPeqF4auVVY2sTDULk8pR36AdtQgdpeRPHx84sQaRdXVikIjKakVXKsYMVY2dlz4t1cM4Yatszbqe0-Izmn_pMtLlgGei0CPuEOAYLUWppGmMkP4ohvCJtY2rEXn7BBfpNL6zCsdapp90oi8NM4ZF3hEV8qEuO4kTMtQNjH4mz4-7V3ffI39jQrmSzzKikdSmaquOm7eizRF84KTn-XqB8b9VZYtvF1nAeZKmMqOxvtqESb8YAnmKwoIVlqEZdBFAy_vJPhg_fzBE0-HZ4iE2QndFqUWTdZORglr04TeklxruAK3_yKkVZga5AO_BsykwmYIEi3pMkyVsWgSsibGLT6gafk1qFdiiMvzNIy_tF-Dnc_XeBKovJEMfP6C34SIDgVCHlxivxLftyU-5zOT3GOUe0xyj9X63z_chplu_-oyvjzrXWzArBjTYaDe6jA1en7xmwhKRnar1DyD-__ebB8pt--A
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT8MwDLbGkBAX3ojxDBKcUMSaJml2QAgxCuMlDpu0W2na9IRWYJsQf41fh92143HgxqmHVqlkO85nO_YHcOCcFNKPA96iPJjMlMdjVAb3PJ1KqzyTWsp33N3rq5687qt-DT6qXhi6Vln5xMJRp3lCOfJj9KMGsbsM5HFWXot4aIenzy-cGKSo0lrRaUxM5Ma9v2H4NjzptFHXh0KEF93zK14yDPBceHrEUwQHVmthmkkiM4QuuilsbG3TWjyYU8QWibQuc0r7muaeNAMnTZqa1HOIrJTxcd0ZmA1oijt1qYeXXwN_g4LzEjex4oFUpqqoTtr2Ak1xvOB0wnL1A93-KsgW51y4BAslQGVnE4tahpobrMBiRf7ASl-wCrpo3eVhjA_WzZ8cMXQ4hhiYndM9UWrOZO14FLM2zeYtabWGa9D7FxGtQ32QD9wGMJMImyE8tKRFP1HGojPIWhixOI_m5DdguxJDVO6kYfSl9wbsT1_jHqDCRjxw-Ri_8REXCgQ7uMRRJb5vS0wnM5PcI5R7RHKP1ObfP9yDOTSx6LZzf7MF82LCg4Fq24b66HXsdhCNjOxuoXYGj_9tZ5-OGe0a
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Fault+Tolerance+for+Cartesian+Data+Distributions&rft.jtitle=International+journal+of+parallel+programming&rft.au=Ali%2C+Nawab&rft.au=Krishnamoorthy%2C+Sriram&rft.au=Halappanavar%2C+Mahantesh&rft.au=Daily%2C+Jeff&rft.date=2013-06-01&rft.pub=Springer+Nature+B.V&rft.issn=0885-7458&rft.eissn=1573-7640&rft.volume=41&rft.issue=3&rft.spage=469&rft_id=info:doi/10.1007%2Fs10766-012-0218-5&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=2893352861
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-7458&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-7458&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-7458&client=summon