Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that...
        Saved in:
      
    
          | Published in | Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12 | 
|---|---|
| Main Authors | , , , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
        New York, NY, USA
          ACM
    
        14.11.2009
     | 
| Series | ACM Conferences | 
| Subjects | 
                                    Software and its engineering
               >                 Software creation and management
               >                 Software verification and validation
           
      
                                    Software and its engineering
               >                 Software creation and management
               >                 Software verification and validation
               >                 Operational analysis
           
      
                                    Software and its engineering
               >                 Software notations and tools
               >                 General programming languages
               >                 Language types
               >                 Parallel programming languages
           
      
      
      
      
   | 
| Online Access | Get full text | 
| ISBN | 1605587443 9781605587448  | 
| ISSN | 2167-4329 | 
| DOI | 10.1145/1654059.1654117 | 
Cover
| Abstract | The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources.
We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system. | 
    
|---|---|
| AbstractList | The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system. The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.  | 
    
| Author | Xie, Yuan Muralimanohar, Naveen Jouppi, Norm Kaufmann, Richard Dong, Xiangyu  | 
    
| Author_xml | – sequence: 1 givenname: Xiangyu surname: Dong fullname: Dong, Xiangyu organization: Pennsylvania State University – sequence: 2 givenname: Naveen surname: Muralimanohar fullname: Muralimanohar, Naveen organization: Hewlett-Packard Labs – sequence: 3 givenname: Norm surname: Jouppi fullname: Jouppi, Norm organization: Hewlett-Packard Labs – sequence: 4 givenname: Richard surname: Kaufmann fullname: Kaufmann, Richard organization: Hewlett-Packard Labs – sequence: 5 givenname: Yuan surname: Xie fullname: Xie, Yuan organization: Pennsylvania State University  | 
    
| BookMark | eNqNkL1PAjEYxmvEREBmB5eOLof97t1I8DPBaIwuLk2vfQ9O4Eraw8h_7xEYHH2XJ2-ej-E3QL0mNIDQJSVjSoW8oUoKIovxXinVJ2hAFZEy10Lw079PD_UZVToTnBXnaJTSF-kup4znso8-Z_AN0c7rZo75LX6dvk2ecQtu0YRVmNeQcBtwBL91gN0C3HIT6qbFoSstwHpchYirbbuNgOHHJmdXgNMutbBOF-issqsEo6MO0cf93fv0MZu9PDxNJ7PMMqHbjFkrFIMccu4Kq5wnnhNZ5oyXSgggBXNQSq-1orRyjjBReeDgC1ta7TnlQ3R12K0BwGxivbZxZxTXUkreudcH17q1KUNYJkOJ2SM0R4TmiLCLjv8ZNWWsoeK_U_Fvbw | 
    
| CODEN | IEEPAD | 
    
| ContentType | Conference Proceeding | 
    
| Copyright | 2009 ACM | 
    
| Copyright_xml | – notice: 2009 ACM | 
    
| DBID | 6IE 6IL CBEJK RIE RIL  | 
    
| DOI | 10.1145/1654059.1654117 | 
    
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present  | 
    
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher  | 
    
| DeliveryMethod | fulltext_linktorsrc | 
    
| Discipline | Computer Science | 
    
| EISBN | 1605587443 9781605587448  | 
    
| EndPage | 12 | 
    
| ExternalDocumentID | 6375553 | 
    
| Genre | orig-research | 
    
| GroupedDBID | 6IE 6IF 6IL 6IN AAJGR AARBI ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI OCL RIE RIL 6IH 6IK AAWTH ABLEC ADZIZ CHZPO IEGSK IPLJI  | 
    
| ID | FETCH-LOGICAL-a247t-2aa462e8e83c9a6cd0d305b823b644e092ceb5d77611fcc024fde3ed9aba7d313 | 
    
| IEDL.DBID | RIE | 
    
| ISBN | 1605587443 9781605587448  | 
    
| ISSN | 2167-4329 | 
    
| IngestDate | Wed Jul 30 06:14:25 EDT 2025 Wed Jan 31 06:48:37 EST 2024 Wed Jan 31 06:45:55 EST 2024  | 
    
| IsPeerReviewed | false | 
    
| IsScholarly | false | 
    
| Language | English | 
    
| License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org | 
    
| LinkModel | DirectLink | 
    
| MeetingName | SC '09: International Conference for High Performance Computing, Networking, Storage and Analysis | 
    
| MergedId | FETCHMERGED-LOGICAL-a247t-2aa462e8e83c9a6cd0d305b823b644e092ceb5d77611fcc024fde3ed9aba7d313 | 
    
| PageCount | 12 | 
    
| ParticipantIDs | ieee_primary_6375553 acm_books_10_1145_1654059_1654117_brief acm_books_10_1145_1654059_1654117  | 
    
| PublicationCentury | 2000 | 
    
| PublicationDate | 20091114 2009-Nov.  | 
    
| PublicationDateYYYYMMDD | 2009-11-14 2009-11-01  | 
    
| PublicationDate_xml | – month: 11 year: 2009 text: 20091114 day: 14  | 
    
| PublicationDecade | 2000 | 
    
| PublicationPlace | New York, NY, USA | 
    
| PublicationPlace_xml | – name: New York, NY, USA | 
    
| PublicationSeriesTitle | ACM Conferences | 
    
| PublicationTitle | Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis | 
    
| PublicationTitleAbbrev | SUPERC | 
    
| PublicationYear | 2009 | 
    
| Publisher | ACM | 
    
| Publisher_xml | – name: ACM | 
    
| SSID | ssj0000812385 ssj0003204180  | 
    
| Score | 1.7222223 | 
    
| Snippet | The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results... | 
    
| SourceID | ieee acm  | 
    
| SourceType | Publisher | 
    
| StartPage | 1 | 
    
| SubjectTerms | Bandwidth Checkpointing Computing methodologies -- Parallel computing methodologies -- Parallel programming languages Error analysis File systems General and reference -- Cross-computing tools and techniques -- Performance Hardware Hardware -- Hardware validation -- Functional verification -- Assertion checking Phase change random access memory Program processors Software Software and its engineering -- Software creation and management -- Software verification and validation Software and its engineering -- Software creation and management -- Software verification and validation -- Operational analysis Software and its engineering -- Software notations and tools -- General programming languages -- Language types -- Parallel programming languages Software and its engineering -- Software organization and properties -- Extra-functional properties -- Software fault tolerance -- Checkpoint -- restart Theory of computation -- Semantics and reasoning -- Program reasoning -- Assertions Three-dimensional displays Transient analysis  | 
    
| Title | Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems | 
    
| URI | https://ieeexplore.ieee.org/document/6375553 | 
    
| hasFullText | 1 | 
    
| inHoldings | 1 | 
    
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1bS-wwEB7UJ588Xg7u8UIEwRe7trm1fRQviLgicgTxpTTJBETcld0uiL_eSdpdUQR9ahr6EGbS-SaTmW8A9rXyaWpRkfVLbRIQKynI8UhUgTqXVuVtaGBwrS_u5OW9ul-Aw3ktDCLG5DPsh2G8y3cjOw2hsiMtcqWUWITFvNBtrdY8nkLQRuij5u-CpzKLjdN4pPYWvOyYfTKpjkINDzkW_fDMsghN9vlTg5WIL-crMJitrE0reepPG9O3b19IG3-79D-w8VHJx27mGLUKCzhcg5VZKwfW_dnr8HCFtKdjxyImTtnNye3xgDWzsDudplkzYuNA84qM1GyfXkaPw4aF_E-y5o6R68taehKGr_WEFI-sJYmebMDd-dn_k4uka7uQ1FzmTcLrWmqOBRbClrW2LnVkFEzBhSHnCdOSWzTK5bnOMm8tgbx3KNCVtalzJzLxF5aGoyFuAjNeo1feSI5eaq-McZJGwpXBDcqwB3sk9yqcJyZVWyKtqk43VaebHhz8-E1lxo_oe7AeBF-9tDwdVSfzf99Pb8FyvBmKdYXbsNSMp7hDDkZjduPOegceF8gN | 
    
| linkProvider | IEEE | 
    
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3dS9xAEB_UPuiTtiq9VtsVhL40Z7JfSR7FD67tnUhRkL6E7O4siHgndzkQ_3pnN7krFqE-ZbPkYZnZzG92duY3AIda-TS1qMj6pTYJiJUU5HgkqkCdS6vyNjQwutCDa_nzRt2swPdlLQwixuQz7IdhvMt3EzsPobIjLXKllFiFd0pKqdpqrWVEhcCN8Ect3wVPZRZbp_FI7i142XH7ZFIdhSoeci364ZllEZzs_YsWKxFhzjdhtFhbm1hy1583pm-f_qFtfOvit2Dnby0fu1yi1HtYwfEH2Fw0c2Ddv70Nf4ZIuzr2LGLilF2e_D4esWYReKfzNGsmbBqIXpGRou3dw-R23LCQAUr23DFyfllLUMLwsZ6R6pG1NNGzHbg-P7s6GSRd44Wk5jJvEl7XUnMssBC2rLV1qSOzYAouDLlPmJbcolEuz3WWeWsJ5r1Dga6sTZ07kYldWBtPxvgRmPEavfJGcvRSe2WMkzQSrgyOUIY9OCC5V-FEMavaImlVdbqpOt304Nt_v6nM9BZ9D7aD4KuHlqmj6mT-6fXpr7A-uBoNq-GPi1-fYSPeE8Uqwz1Ya6Zz3Cd3ozFf4i57BseGy1o | 
    
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+Conference+on+High+Performance+Computing+Networking%2C+Storage+and+Analysis&rft.atitle=Leveraging+3D+PCRAM+technologies+to+reduce+checkpoint+overhead+for+future+exascale+systems&rft.au=Dong%2C+Xiangyu&rft.au=Muralimanohar%2C+Naveen&rft.au=Jouppi%2C+Norm&rft.au=Kaufmann%2C+Richard&rft.series=ACM+Conferences&rft.date=2009-11-14&rft.pub=ACM&rft.isbn=1605587443&rft.spage=1&rft.epage=12&rft_id=info:doi/10.1145%2F1654059.1654117 | 
    
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2167-4329&client=summon | 
    
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2167-4329&client=summon | 
    
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2167-4329&client=summon |