Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12
Main Authors Dong, Xiangyu, Muralimanohar, Naveen, Jouppi, Norm, Kaufmann, Richard, Xie, Yuan
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 14.11.2009
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN1605587443
9781605587448
ISSN2167-4329
DOI10.1145/1654059.1654117

Cover

Abstract The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
AbstractList The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
Author Xie, Yuan
Muralimanohar, Naveen
Jouppi, Norm
Kaufmann, Richard
Dong, Xiangyu
Author_xml – sequence: 1
  givenname: Xiangyu
  surname: Dong
  fullname: Dong, Xiangyu
  organization: Pennsylvania State University
– sequence: 2
  givenname: Naveen
  surname: Muralimanohar
  fullname: Muralimanohar, Naveen
  organization: Hewlett-Packard Labs
– sequence: 3
  givenname: Norm
  surname: Jouppi
  fullname: Jouppi, Norm
  organization: Hewlett-Packard Labs
– sequence: 4
  givenname: Richard
  surname: Kaufmann
  fullname: Kaufmann, Richard
  organization: Hewlett-Packard Labs
– sequence: 5
  givenname: Yuan
  surname: Xie
  fullname: Xie, Yuan
  organization: Pennsylvania State University
BookMark eNqNkL1PAjEYxmvEREBmB5eOLof97t1I8DPBaIwuLk2vfQ9O4Eraw8h_7xEYHH2XJ2-ej-E3QL0mNIDQJSVjSoW8oUoKIovxXinVJ2hAFZEy10Lw079PD_UZVToTnBXnaJTSF-kup4znso8-Z_AN0c7rZo75LX6dvk2ecQtu0YRVmNeQcBtwBL91gN0C3HIT6qbFoSstwHpchYirbbuNgOHHJmdXgNMutbBOF-issqsEo6MO0cf93fv0MZu9PDxNJ7PMMqHbjFkrFIMccu4Kq5wnnhNZ5oyXSgggBXNQSq-1orRyjjBReeDgC1ta7TnlQ3R12K0BwGxivbZxZxTXUkreudcH17q1KUNYJkOJ2SM0R4TmiLCLjv8ZNWWsoeK_U_Fvbw
CODEN IEEPAD
ContentType Conference Proceeding
Copyright 2009 ACM
Copyright_xml – notice: 2009 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/1654059.1654117
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 1605587443
9781605587448
EndPage 12
ExternalDocumentID 6375553
Genre orig-research
GroupedDBID 6IE
6IF
6IL
6IN
AAJGR
AARBI
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
OCL
RIE
RIL
6IH
6IK
AAWTH
ABLEC
ADZIZ
CHZPO
IEGSK
IPLJI
ID FETCH-LOGICAL-a247t-2aa462e8e83c9a6cd0d305b823b644e092ceb5d77611fcc024fde3ed9aba7d313
IEDL.DBID RIE
ISBN 1605587443
9781605587448
ISSN 2167-4329
IngestDate Wed Jul 30 06:14:25 EDT 2025
Wed Jan 31 06:48:37 EST 2024
Wed Jan 31 06:45:55 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis
MergedId FETCHMERGED-LOGICAL-a247t-2aa462e8e83c9a6cd0d305b823b644e092ceb5d77611fcc024fde3ed9aba7d313
PageCount 12
ParticipantIDs ieee_primary_6375553
acm_books_10_1145_1654059_1654117_brief
acm_books_10_1145_1654059_1654117
PublicationCentury 2000
PublicationDate 20091114
2009-Nov.
PublicationDateYYYYMMDD 2009-11-14
2009-11-01
PublicationDate_xml – month: 11
  year: 2009
  text: 20091114
  day: 14
PublicationDecade 2000
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis
PublicationTitleAbbrev SUPERC
PublicationYear 2009
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0000812385
ssj0003204180
Score 1.7222223
Snippet The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results...
SourceID ieee
acm
SourceType Publisher
StartPage 1
SubjectTerms Bandwidth
Checkpointing
Computing methodologies -- Parallel computing methodologies -- Parallel programming languages
Error analysis
File systems
General and reference -- Cross-computing tools and techniques -- Performance
Hardware
Hardware -- Hardware validation -- Functional verification -- Assertion checking
Phase change random access memory
Program processors
Software
Software and its engineering -- Software creation and management -- Software verification and validation
Software and its engineering -- Software creation and management -- Software verification and validation -- Operational analysis
Software and its engineering -- Software notations and tools -- General programming languages -- Language types -- Parallel programming languages
Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart
Theory of computation -- Semantics and reasoning -- Program reasoning -- Assertions
Three-dimensional displays
Transient analysis
Title Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
URI https://ieeexplore.ieee.org/document/6375553
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS-wwEB7UJ588Xg7u8UIEwRe7trm1fRQviLgicgTxpTTJBETcld0uiL_eSdpdUQR9ahr6EGbS-SaTmW8A9rXyaWpRkfVLbRIQKynI8UhUgTqXVuVtaGBwrS_u5OW9ul-Aw3ktDCLG5DPsh2G8y3cjOw2hsiMtcqWUWITFvNBtrdY8nkLQRuij5u-CpzKLjdN4pPYWvOyYfTKpjkINDzkW_fDMsghN9vlTg5WIL-crMJitrE0reepPG9O3b19IG3-79D-w8VHJx27mGLUKCzhcg5VZKwfW_dnr8HCFtKdjxyImTtnNye3xgDWzsDudplkzYuNA84qM1GyfXkaPw4aF_E-y5o6R68taehKGr_WEFI-sJYmebMDd-dn_k4uka7uQ1FzmTcLrWmqOBRbClrW2LnVkFEzBhSHnCdOSWzTK5bnOMm8tgbx3KNCVtalzJzLxF5aGoyFuAjNeo1feSI5eaq-McZJGwpXBDcqwB3sk9yqcJyZVWyKtqk43VaebHhz8-E1lxo_oe7AeBF-9tDwdVSfzf99Pb8FyvBmKdYXbsNSMp7hDDkZjduPOegceF8gN
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS9xAEB_UPuiTtiq9VtsVhL40Z7JfSR7FD67tnUhRkL6E7O4siHgndzkQ_3pnN7krFqE-ZbPkYZnZzG92duY3AIda-TS1qMj6pTYJiJUU5HgkqkCdS6vyNjQwutCDa_nzRt2swPdlLQwixuQz7IdhvMt3EzsPobIjLXKllFiFd0pKqdpqrWVEhcCN8Ect3wVPZRZbp_FI7i142XH7ZFIdhSoeci364ZllEZzs_YsWKxFhzjdhtFhbm1hy1583pm-f_qFtfOvit2Dnby0fu1yi1HtYwfEH2Fw0c2Ddv70Nf4ZIuzr2LGLilF2e_D4esWYReKfzNGsmbBqIXpGRou3dw-R23LCQAUr23DFyfllLUMLwsZ6R6pG1NNGzHbg-P7s6GSRd44Wk5jJvEl7XUnMssBC2rLV1qSOzYAouDLlPmJbcolEuz3WWeWsJ5r1Dga6sTZ07kYldWBtPxvgRmPEavfJGcvRSe2WMkzQSrgyOUIY9OCC5V-FEMavaImlVdbqpOt304Nt_v6nM9BZ9D7aD4KuHlqmj6mT-6fXpr7A-uBoNq-GPi1-fYSPeE8Uqwz1Ya6Zz3Cd3ozFf4i57BseGy1o
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+Conference+on+High+Performance+Computing+Networking%2C+Storage+and+Analysis&rft.atitle=Leveraging+3D+PCRAM+technologies+to+reduce+checkpoint+overhead+for+future+exascale+systems&rft.au=Dong%2C+Xiangyu&rft.au=Muralimanohar%2C+Naveen&rft.au=Jouppi%2C+Norm&rft.au=Kaufmann%2C+Richard&rft.series=ACM+Conferences&rft.date=2009-11-14&rft.pub=ACM&rft.isbn=1605587443&rft.spage=1&rft.epage=12&rft_id=info:doi/10.1145%2F1654059.1654117
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2167-4329&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2167-4329&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2167-4329&client=summon