Multi-Fault Tolerance for Cartesian Data Distributions
Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than rep...
Saved in:
| Published in | International journal of parallel programming Vol. 41; no. 3; pp. 469 - 493 |
|---|---|
| Main Authors | , , , |
| Format | Journal Article |
| Language | English |
| Published |
Boston
Springer US
01.06.2013
Springer Nature B.V |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0885-7458 1573-7640 |
| DOI | 10.1007/s10766-012-0218-5 |
Cover
| Abstract | Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead. |
|---|---|
| AbstractList | Issue Title: Special Issue: Computing Frontiers Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead.[PUBLICATION ABSTRACT] Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems. Algorithm-based fault tolerance is a promising approach that involves modifications to the algorithm to recover from faults with lower overheads than replicated storage and a significant reduction in lost work compared to checkpoint-restart techniques. Fault-tolerant linear algebra algorithms employ additional processors that store parities along the dimensions of a matrix to tolerate multiple, simultaneous faults. Existing approaches assume regular data distributions (blocked or block-cyclic) with the failures of each data block being independent. To match the characteristics of failures on parallel computers, we extend these approaches to mapping parity blocks in several important ways. First, we handle parity computation for generalized Cartesian data distributions with each processor holding arbitrary subsets of blocks in a Cartesian-distributed array. Second, techniques to handle correlated failures, i.e., multiple processors that can be expected to fail together, are presented. Third, we handle the colocation of parity blocks with the data blocks and do not require them to be on additional processors. Several alternative approaches, based on graph matching, are presented that attempt to balance the memory overhead on processors while guaranteeing the same fault tolerance properties as existing approaches that assume independent failures on regular blocked data distributions. Evaluation of these algorithms demonstrates that the additional desirable properties are provided by the proposed approach with minimal overhead. |
| Author | Daily, Jeff Halappanavar, Mahantesh Krishnamoorthy, Sriram Ali, Nawab |
| Author_xml | – sequence: 1 givenname: Nawab surname: Ali fullname: Ali, Nawab organization: Pacific Northwest National Laboratory – sequence: 2 givenname: Sriram surname: Krishnamoorthy fullname: Krishnamoorthy, Sriram email: sriram@pnnl.gov organization: Pacific Northwest National Laboratory – sequence: 3 givenname: Mahantesh surname: Halappanavar fullname: Halappanavar, Mahantesh organization: Pacific Northwest National Laboratory – sequence: 4 givenname: Jeff surname: Daily fullname: Daily, Jeff organization: Pacific Northwest National Laboratory |
| BookMark | eNpdkL1OwzAYRS1UJNrCA7BFYmExfP53RtRSQCpiKbNlJw5KFexiO-9PqnZATHe4R1dXZ4FmIQaP0C2BBwKgHjMBJSUGQjFQorG4QHMiFMNKcpihOWgtsOJCX6FFznsAqJXWcyTfx6H0eGOnqHZx8MmGxlddTNXKpuJzb0O1tsVW6z6X1Lux9DHka3TZ2SH7m3Mu0efmebd6xduPl7fV0xZHSmTBbU1rJyXV0DS8A6EkUGedA-eE5q2QpOHOd15IJgWbWuW5blvdEg91LTRbovvT7iHFn9HnYr773PhhsMHHMRvCKKMUlIYJvfuH7uOYwvTOEKo1IZwrPlH0ROVD6sOXT38oMEeV5qTSTCrNUaUR7Be6lGZu |
| CODEN | IJPPE5 |
| ContentType | Journal Article |
| Copyright | Springer Science+Business Media New York 2012 Springer Science+Business Media New York 2013 |
| Copyright_xml | – notice: Springer Science+Business Media New York 2012 – notice: Springer Science+Business Media New York 2013 |
| DBID | 0U~ 1-H 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8FD 8FE 8FG 8FK 8FL 8G5 ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ GUQSH HCIFZ JQ2 K60 K6~ K7- L.- L.0 L7M L~C L~D M0C M0N M2O MBDVC P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI Q9U |
| DOI | 10.1007/s10766-012-0218-5 |
| DatabaseName | Global News & ABI/Inform Professional Trade PRO ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Global (Alumni Edition) Computing Database (Alumni Edition) Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni Edition) Research Library ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Database (1962 - current) ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Community College ProQuest Central Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student Research Library Prep SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database (Proquest) ABI/INFORM Professional Advanced ABI/INFORM Professional Standard Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Research Library Research Library (Corporate) Advanced Technologies & Aerospace Collection ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central Basic |
| DatabaseTitle | ProQuest Business Collection (Alumni Edition) Research Library Prep Computer Science Database ProQuest Central Student ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts SciTech Premium Collection Trade PRO ABI/INFORM Complete Global News & ABI/Inform Professional ProQuest One Applied & Life Sciences ProQuest Central (New) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest Business Collection ProQuest One Academic UKI Edition ProQuest One Academic ProQuest One Academic (New) ABI/INFORM Global (Corporate) ProQuest One Business Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Central (Alumni Edition) ProQuest One Community College Research Library (Alumni Edition) ProQuest Central ABI/INFORM Professional Advanced ABI/INFORM Professional Standard ProQuest Central Korea ProQuest Research Library Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest SciTech Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Business (Alumni) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
| DatabaseTitleList | ProQuest Business Collection (Alumni Edition) Computer and Information Systems Abstracts |
| Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 1573-7640 |
| EndPage | 493 |
| ExternalDocumentID | 2893352861 10_1007_s10766_012_0218_5 |
| Genre | Feature |
| GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C -~X .4S .86 .DC .VR 06D 0R~ 0VY 199 1N0 2.D 203 28- 29J 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 78A 7WY 8FE 8FG 8FL 8G5 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYJJ AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDBF ABDPE ABDZT ABECU ABFSI ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTAH ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFO ACGFS ACHSB ACHXU ACIHN ACKNC ACMDZ ACMLO ACNCT ACOKC ACOMO ACPIV ACREN ACUHS ACZOJ ADHIR ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEAQA AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AOCGG ARAPS ARCSS ARMRJ AXYYD AYJHY AZFZN AZQEC B-. B0M BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BKOMP BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO E.L EAD EAP EAS EBLON EBS EDO EIOEI EJD EMK EPL ESBYG ESX FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GROUPED_ABI_INFORM_RESEARCH GUQSH GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ H~9 I-F I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW LAK LLZTM M0C M0N M2O M4Y MA- MS~ N2Q NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOK QOS R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TAE TEORI TN5 TSG TSK TSV TUC TUS U2A U5U UG4 UOJIU UTJUX UZXMN VC2 VFIZW VXZ W23 W48 WH7 WK8 YLTOR Z45 Z7R Z7X Z81 Z83 Z88 Z8R Z8W Z92 ZMTXR ZY4 ~8M ~EX 0U~ 1-H 7SC 7XB 8AL 8FD 8FK AAPKM ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG AEZWR AFDZB AFHIU AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA JQ2 L.- L.0 L7M L~C L~D MBDVC PHGZM PHGZT PKEHL PQEST PQGLB PQUKI Q9U PUEGO |
| ID | FETCH-LOGICAL-o216t-d929b66280cc4f057602babb0bb584d561c4befe5636537607e48dd8d1e099583 |
| IEDL.DBID | BENPR |
| ISSN | 0885-7458 |
| IngestDate | Wed Oct 01 15:07:43 EDT 2025 Fri Jul 25 23:32:18 EDT 2025 Fri Feb 21 02:37:22 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3 |
| Keywords | Data distribution Fault tolerance Checksums Fault tolerant linear algebra |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-o216t-d929b66280cc4f057602babb0bb584d561c4befe5636537607e48dd8d1e099583 |
| Notes | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23 |
| PQID | 1288114474 |
| PQPubID | 48389 |
| PageCount | 25 |
| ParticipantIDs | proquest_miscellaneous_1323220780 proquest_journals_1288114474 springer_journals_10_1007_s10766_012_0218_5 |
| PublicationCentury | 2000 |
| PublicationDate | 20130600 20130601 |
| PublicationDateYYYYMMDD | 2013-06-01 |
| PublicationDate_xml | – month: 6 year: 2013 text: 20130600 |
| PublicationDecade | 2010 |
| PublicationPlace | Boston |
| PublicationPlace_xml | – name: Boston – name: New York |
| PublicationTitle | International journal of parallel programming |
| PublicationTitleAbbrev | Int J Parallel Prog |
| PublicationYear | 2013 |
| Publisher | Springer US Springer Nature B.V |
| Publisher_xml | – name: Springer US – name: Springer Nature B.V |
| References | Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu ChenZ.DongarraJ.Algorithm-based fault tolerance for fail-stop failuresIEEE Trans. Parallel Distrib. Syst.200819121628164110.1109/TPDS.2008.58 Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009) HuangK.-H.AbrahamJ.A.Algorithm-based fault tolerance for matrix operationsIEEE Trans. Comput.19843365185280557.6802710.1109/TC.1984.1676475 NieplochaJ.PalmerB.TipparajuV.KrishnanM.TreaseH.Aprà àE.Advances, applications and performance of the global arrays shared memory programming toolkitInt. J. High Perform. Comput. Appl.20062020323110.1177/1094342006064503 Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994) Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a) Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b) Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011) Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009) ElnozahyE.N.AlvisiL.WangY.-M.JohnsonD.B.A survey of rollback-recovery protocols in message-passing systemsACM Comput. Surv.200234337540810.1145/568522.568525 SchrijverA.Combinatorial Optimization: Polyhedra and Efficiency2003New YorkSpringer Publishing Co.1041.90001 Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009) Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995) LovaszL.PlummerM.D.Matching Theory1986AmsterdamNorth-Holland Publishing Co.0618.05001 Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007) Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009) Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000) LawlerE.Combinatorial Optimization: Networks and Matroids2001MineolaDover Publications1058.90057 Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004) GabowH.N.An efficient implementation of edmonds’ algorithm for maximum matching on graphsJ. ACM19762322212344059230327.0512110.1145/321941.321942 Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008) ValievM.BylaskaE.GovindN.KowalskiK.StraatsmaT.DamH.V.WangD.NieplochaJ.ApraE.WindusT.de JongW.NWChem: a comprehensive and scalable open-source solution for large scale molecular simulationsComput. Phys. Commun.20101819147714891216.8117910.1016/j.cpc.2010.04.018 Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c) Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007) HopcroftJ.KarpR.A \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${n^{\frac{5}{2}}}$$\end{document} algorithm for maximum matchings in bipartite graphsSIAM J. Comput.197322252313376990266.0511410.1137/0202019 SchroederB.GibsonG.A.Understanding failures in petascale computersJ. Phys. Conf. Ser.2007781111 The ScaLAPACK project. http://www.netlib.org/scalapack Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009) WolseyL.A.Integer Programming1998HobokenWiley0930.90072 HPL. http://www.netlib.org/benchmark/hpl BosilcaG.DelmasR.DongarraJ.LangouJ.Algorithm-based fault tolerance applied to high performance computingJ. Parallel Distrib. Comput.200969441041610.1016/j.jpdc.2008.12.002 HargroveP.H.DuellJ.C.Berkeley lab checkpoint/restart (BLCR) for Linux clustersJ. Phys. Conf. Ser.200646149449910.1088/1742-6596/46/1/067 KuhnH.W.The Hungarian method for the assignment problemNaval Res. Logist. Q.19552839710.1002/nav.3800020109 PlankJ.S.LiK.PueningM.A.Diskless checkpointingIEEE Trans. Parallel Distrib. Syst.199891097298610.1109/71.730527 BurkardR.Dell’AmicoM.MartelloS.Assignment Problems2009PhiladelphiaSociety for Industrial and Applied Mathematics1196.9000210.1137/1.9780898717754 MotwaniR.Average-case analysis of algorithms for matchings and related problemsJ. ACM19944161329135613715020829.6807010.1145/195613.195663 PapadimitriouC.H.SteiglitzK.Combinatorial Optimization: Algorithms and Complexity1982Upper Saddle RiverPrentice-Hall Inc.0503.90060 Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006) DarteA.Mellor-CrummeyJ.FowlerR.Chavarría-MirandaD.Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computationsJ. Parallel Distrib. Comput.20036398879111047.6816410.1016/S0743-7315(03)00103-5 Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004) |
| References_xml | – reference: PlankJ.S.LiK.PueningM.A.Diskless checkpointingIEEE Trans. Parallel Distrib. Syst.199891097298610.1109/71.730527 – reference: Plank, J.S., Beck, M., Kingsley, G., Li, K.: Libckpt: transparent checkpointing under Unix. In: Usenix Winter Technical Conference, pp. 213–223, Jan (1995) – reference: BosilcaG.DelmasR.DongarraJ.LangouJ.Algorithm-based fault tolerance applied to high performance computingJ. Parallel Distrib. Comput.200969441041610.1016/j.jpdc.2008.12.002 – reference: Chen, Z., Dongarra, J.: Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In: IEEE International Parallel & Distributed Processing Symposium, Apr (2006) – reference: Dean, J., Ghemawat S.: MapReduce: simplified data processing on large clusters. In: USENIX Symposium on Operating Systems Design and Implementation, pp. 137–150 (2004) – reference: The ScaLAPACK project. http://www.netlib.org/scalapack – reference: Wang, C., Mueller, F., Engelmann, C., Scott, S.L.: Proactive process-level live migration in HPC environments. In: Proceedings of the ACM/IEEE Conference on Supercomputing, pp. 1–12, Nov (2008) – reference: NieplochaJ.PalmerB.TipparajuV.KrishnanM.TreaseH.Aprà àE.Advances, applications and performance of the global arrays shared memory programming toolkitInt. J. High Perform. Comput. Appl.20062020323110.1177/1094342006064503 – reference: Tipparaju, V., Krishnan, M., Palmer, B., Petrini, F., Nieplocha, J.: Towards fault resilient global arrays. In: International Conference on Parallel Computing, vol. 15, pp. 339–345 (2007) – reference: ChenZ.DongarraJ.Algorithm-based fault tolerance for fail-stop failuresIEEE Trans. Parallel Distrib. Syst.200819121628164110.1109/TPDS.2008.58 – reference: Engelmann, C., Vallée, G., Naughton, T., Scott, S.L.: Proactive fault tolerance using preemptive migration. In: International Conference on Parallel, Distributed and Network-based Processing, pp. 252–257, Feb (2009) – reference: Ali, N., Krishnamoorthy, S., Halappanavar, M., Daily, J.: Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In: ACM International Conference on Computing Frontiers, May (2011c) – reference: Panda, D.K.: MVAPICH. http://mvapich.cse.ohio-state.edu – reference: Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems, pp. 59–72 (2007) – reference: HuangK.-H.AbrahamJ.A.Algorithm-based fault tolerance for matrix operationsIEEE Trans. Comput.19843365185280557.6802710.1109/TC.1984.1676475 – reference: LovaszL.PlummerM.D.Matching Theory1986AmsterdamNorth-Holland Publishing Co.0618.05001 – reference: PapadimitriouC.H.SteiglitzK.Combinatorial Optimization: Algorithms and Complexity1982Upper Saddle RiverPrentice-Hall Inc.0503.90060 – reference: GabowH.N.An efficient implementation of edmonds’ algorithm for maximum matching on graphsJ. ACM19762322212344059230327.0512110.1145/321941.321942 – reference: MotwaniR.Average-case analysis of algorithms for matchings and related problemsJ. ACM19944161329135613715020829.6807010.1145/195613.195663 – reference: Ali, N., Carns, P.H., Iskra, K., Kimpe, D., Lang, S., Latham, R., Ross, R.B., Ward, L., Sadayappan, P.: Scalable I/O forwarding framework for high-performance computing systems. In: IEEE International Conference on Cluster Computing, pp. 1–10, Aug (2009) – reference: Halappanavar, M.: Algorithms for vertex-weighted matching in graphs. PhD thesis, Old Dominion University, Norfolk, VA (2009) – reference: Costa, P., Pasin, M., Bessani, A., Correia, M.: Byzantine fault-tolerant mapreduce: faults are not just crashes. In: IEEE International Conference on Cloud Computing Technology and Science, pp. 32–39 (2011) – reference: Bronevetsky, G., Moody, A.: Scalable I/O systems via node-local storage: approaching 1 TB/sec file I/O. Technical report LLNL-TR-415791, Lawrence Livermore National Laboratory, Aug (2009) – reference: ValievM.BylaskaE.GovindN.KowalskiK.StraatsmaT.DamH.V.WangD.NieplochaJ.ApraE.WindusT.de JongW.NWChem: a comprehensive and scalable open-source solution for large scale molecular simulationsComput. Phys. Commun.20101819147714891216.8117910.1016/j.cpc.2010.04.018 – reference: Gupta, R., Beckman, P., Park, B.-H., Lusk, E., Hargrove, P., Geist, A., Panda, D., Lumsdaine, A., Dongarra, J.: CIFTS: a coordinated infrastructure for fault-tolerant systems. In: Proceedings of the International Conference on Parallel Processing, pp. 237–245 (2009) – reference: SchroederB.GibsonG.A.Understanding failures in petascale computersJ. Phys. Conf. Ser.2007781111 – reference: HargroveP.H.DuellJ.C.Berkeley lab checkpoint/restart (BLCR) for Linux clustersJ. Phys. Conf. Ser.200646149449910.1088/1742-6596/46/1/067 – reference: KuhnH.W.The Hungarian method for the assignment problemNaval Res. Logist. Q.19552839710.1002/nav.3800020109 – reference: ElnozahyE.N.AlvisiL.WangY.-M.JohnsonD.B.A survey of rollback-recovery protocols in message-passing systemsACM Comput. Surv.200234337540810.1145/568522.568525 – reference: Ali, N., Krishnamoorthy, S., Govind, N., Kowalski, K., Sadayappan, P.: Application-specific fault tolerance via data access characterization. In International European Conference on Parallel and Distributed Computing, Aug (2011a) – reference: LawlerE.Combinatorial Optimization: Networks and Matroids2001MineolaDover Publications1058.90057 – reference: Ali, N., Krishnamoorthy, S., Govind, N., Palmer, B.: A redundant communication approachq to scalable fault tolerance in PGAS programming models. In: 19th Euromicro International Conference on Parallel, Distributed and Network-Based Computing, Ayia Napa, Cyprus, Feb (2011b) – reference: HopcroftJ.KarpR.A \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$${n^{\frac{5}{2}}}$$\end{document} algorithm for maximum matchings in bipartite graphsSIAM J. Comput.197322252313376990266.0511410.1137/0202019 – reference: SchrijverA.Combinatorial Optimization: Polyhedra and Efficiency2003New YorkSpringer Publishing Co.1041.90001 – reference: DarteA.Mellor-CrummeyJ.FowlerR.Chavarría-MirandaD.Generalized multipartitioning of multi-dimensional arrays for parallelizing line-sweep computationsJ. Parallel Distrib. Comput.20036398879111047.6816410.1016/S0743-7315(03)00103-5 – reference: Zheng, G., Shi, L., Kale, L.V.: FTC-Charm++: an in-memory checkpoint-based fault tolerant runtime for charm++ and MPI. In: IEEE International Conference on Cluster Computing, pp. 93–103, Sept (2004) – reference: Fagg, G.E., Dongarra, J.: FT-MPI: fault tolerant MPI, supporting dynamic applications in a dynamic world. In: Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface, pp. 346–353 (2000) – reference: HPL. http://www.netlib.org/benchmark/hpl – reference: Plank, J., Li, K.: Faster checkpointing with N + 1 parity. In: International Symposium on Fault-Tolerant Computing, pp. 288–297, June (1994) – reference: BurkardR.Dell’AmicoM.MartelloS.Assignment Problems2009PhiladelphiaSociety for Industrial and Applied Mathematics1196.9000210.1137/1.9780898717754 – reference: WolseyL.A.Integer Programming1998HobokenWiley0930.90072 |
| SSID | ssj0009788 |
| Score | 1.9425082 |
| Snippet | Faults are expected to play an increasingly important role in how algorithms and applications are designed to run on future extreme-scale systems.... Issue Title: Special Issue: Computing Frontiers Faults are expected to play an increasingly important role in how algorithms and applications are designed to... |
| SourceID | proquest springer |
| SourceType | Aggregation Database Publisher |
| StartPage | 469 |
| SubjectTerms | Algorithms Blocking Computer Science Failure Fault tolerance Faults Handles Linear algebra Parallel processing Parity Processor Architectures Processors Software Engineering/Programming and Operating Systems Sparsity Studies Theory of Computation |
| SummonAdditionalLinks | – databaseName: SpringerLink Journals (ICM) dbid: U2A link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlZ07T8MwEIBPUBYW3ohCQUZiQ5Zix3bcsWqJKiSYWqlbZCfOhBKJpv-fuzQpBbEwZYhzw52d--zz3QE8haCkil3Cx3QOpkotuENjcCFMobwWtvB03vH2buZL9brSqy6Pe93fdu9Dku2fei_ZLTG0-5Wc_BLXh3CkqZoXTuKlnHxX2k3aZpO4ejRPlLZ9KPMvET-w8lcktHUw6RmcdGTIJltTnsNBqC7gtO-6wLpFeAmmzZnlqcMHW9QfgVpjBIbwyaZ0QZOyItnMNY7NqChu189qfQXL9GUxnfOu-wGvpTANLxBcvDHSRnmuSsQqE0nvvI-8R2gokHty5UMZtIkN1WSJkqBsUdhCBKQ-beNrGFR1FW6A2Vz6ErnMK4XAkGvrcRWWY9wqBEEF6ocw6tWQdVN4naHjsrhZwi-G8Lh7jZOPIgquCvUGx8QIZBIpA0U89-rbE7EriUx6z1DvGek907f_Gn0Hx3LbhAJNN4JB87kJ94gCjX9oTf8FYQKpTg priority: 102 providerName: Springer Nature |
| Title | Multi-Fault Tolerance for Cartesian Data Distributions |
| URI | https://link.springer.com/article/10.1007/s10766-012-0218-5 https://www.proquest.com/docview/1288114474 https://www.proquest.com/docview/1323220780 |
| Volume | 41 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVEBS databaseName: EBSCOhost Academic Search Ultimate customDbUrl: https://search.ebscohost.com/login.aspx?authtype=ip,shib&custid=s3936755&profile=ehost&defaultdb=asn eissn: 1573-7640 dateEnd: 20241031 omitProxy: true ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: ABDBF dateStart: 20030201 isFulltext: true titleUrlDefault: https://search.ebscohost.com/direct.asp?db=asn providerName: EBSCOhost – providerCode: PRVEBS databaseName: Inspec with Full Text customDbUrl: eissn: 1573-7640 dateEnd: 20241031 omitProxy: false ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: ADMLS dateStart: 19970201 isFulltext: true titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text providerName: EBSCOhost – providerCode: PRVLSH databaseName: SpringerLink Journals customDbUrl: mediaType: online eissn: 1573-7640 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: AFBBN dateStart: 19970201 isFulltext: true providerName: Library Specific Holdings – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: http://www.proquest.com/pqcentral?accountid=15518 eissn: 1573-7640 dateEnd: 20171231 omitProxy: true ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: BENPR dateStart: 19970201 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Technology Collection customDbUrl: eissn: 1573-7640 dateEnd: 20241031 omitProxy: true ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: 8FG dateStart: 19970201 isFulltext: true titleUrlDefault: https://search.proquest.com/technologycollection1 providerName: ProQuest – providerCode: PRVAVX databaseName: SpringerLINK - Czech Republic Consortium customDbUrl: eissn: 1573-7640 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: AGYKE dateStart: 19970101 isFulltext: true titleUrlDefault: http://link.springer.com providerName: Springer Nature – providerCode: PRVAVX databaseName: SpringerLink Journals (ICM) customDbUrl: eissn: 1573-7640 dateEnd: 99991231 omitProxy: true ssIdentifier: ssj0009788 issn: 0885-7458 databaseCode: U2A dateStart: 19970101 isFulltext: true titleUrlDefault: http://www.springerlink.com/journals/ providerName: Springer Nature |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3JTsMwEB11uXBhRxRKFSRuyKJxbMc9IFToJpYKAZXKKYpj54QaoO3_M5MmZTlw8sGRI82M7WeP5z2AM-cEF0Ecsg7dg4lU-ixGZzDfV1YY6Wtr6L7jYaxGE3E7ldMKjMtaGHpWWa6J-UJts4TuyC9wHdWI3UUort4_GKlGUXa1lNCIC2kFe5lTjFWhzokZqwb16_748embhjfMlShxakkWCqnLPOeqmC5UdLrmjPY9Jn9hzj9p0nz3GWzDZgEbve7KzztQcbNd2ColGbxihu6Bygtq2SDGxnvJ3hzpZjgPkal3Q683qWTS68WL2OsRY24hdjXfh8mg_3IzYoU0Asu4rxbMIqoxSnHdThKRIuZSbW5iY9rGIKKwCIoSYVzqpAoUEba0Qye0tdr6DiGh1MEB1GbZzB2CpxNuUgRthmwbJFIbnKJpB88Rzif2-gY0SzNERXzPo29vNOB03Y2RSemGeOayJX4TIFrjCEFwiPPSfD-GWPMlk90jtHtEdo_k0f8_PIYNvpKkQF81obb4XLoTBAYL04KqHgxbUO_2Hu6fqR2-3vVbRQxg74R3vwA0arhL |
| linkProvider | ProQuest |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT8MwDLZ4HODCGzEYECQ4oYg1TdLsMCFgTOM1ITQkbqVp0hNagQ0h_hy_DbtreR24ceqhVSrZif3Zjv0B7HovhQyTiDcpDyYzFfAElcGDQDtpVWCcpXzHVU93b-X5nbqbgPeqF4auVVY2sTDULk8pR36AdtQgdpeRPHx84sQaRdXVikIjKakVXKsYMVY2dlz4t1cM4Yatszbqe0-Izmn_pMtLlgGei0CPuEOAYLUWppGmMkP4ohvCJtY2rEXn7BBfpNL6zCsdapp90oi8NM4ZF3hEV8qEuO4kTMtQNjH4mz4-7V3ffI39jQrmSzzKikdSmaquOm7eizRF84KTn-XqB8b9VZYtvF1nAeZKmMqOxvtqESb8YAnmKwoIVlqEZdBFAy_vJPhg_fzBE0-HZ4iE2QndFqUWTdZORglr04TeklxruAK3_yKkVZga5AO_BsykwmYIEi3pMkyVsWgSsibGLT6gafk1qFdiiMvzNIy_tF-Dnc_XeBKovJEMfP6C34SIDgVCHlxivxLftyU-5zOT3GOUe0xyj9X63z_chplu_-oyvjzrXWzArBjTYaDe6jA1en7xmwhKRnar1DyD-__ebB8pt--A |
| linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT8MwDLbGkBAX3ojxDBKcUMSaJml2QAgxCuMlDpu0W2na9IRWYJsQf41fh92143HgxqmHVqlkO85nO_YHcOCcFNKPA96iPJjMlMdjVAb3PJ1KqzyTWsp33N3rq5687qt-DT6qXhi6Vln5xMJRp3lCOfJj9KMGsbsM5HFWXot4aIenzy-cGKSo0lrRaUxM5Ma9v2H4NjzptFHXh0KEF93zK14yDPBceHrEUwQHVmthmkkiM4QuuilsbG3TWjyYU8QWibQuc0r7muaeNAMnTZqa1HOIrJTxcd0ZmA1oijt1qYeXXwN_g4LzEjex4oFUpqqoTtr2Ak1xvOB0wnL1A93-KsgW51y4BAslQGVnE4tahpobrMBiRf7ASl-wCrpo3eVhjA_WzZ8cMXQ4hhiYndM9UWrOZO14FLM2zeYtabWGa9D7FxGtQ32QD9wGMJMImyE8tKRFP1HGojPIWhixOI_m5DdguxJDVO6kYfSl9wbsT1_jHqDCRjxw-Ri_8REXCgQ7uMRRJb5vS0wnM5PcI5R7RHKP1ObfP9yDOTSx6LZzf7MF82LCg4Fq24b66HXsdhCNjOxuoXYGj_9tZ5-OGe0a |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Multi-Fault+Tolerance+for+Cartesian+Data+Distributions&rft.jtitle=International+journal+of+parallel+programming&rft.au=Ali%2C+Nawab&rft.au=Krishnamoorthy%2C+Sriram&rft.au=Halappanavar%2C+Mahantesh&rft.au=Daily%2C+Jeff&rft.date=2013-06-01&rft.pub=Springer+Nature+B.V&rft.issn=0885-7458&rft.eissn=1573-7640&rft.volume=41&rft.issue=3&rft.spage=469&rft_id=info:doi/10.1007%2Fs10766-012-0218-5&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=2893352861 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0885-7458&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0885-7458&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0885-7458&client=summon |