Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems

The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12
Main Authors Dong, Xiangyu, Muralimanohar, Naveen, Jouppi, Norm, Kaufmann, Richard, Xie, Yuan
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 14.11.2009
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN1605587443
9781605587448
ISSN2167-4329
DOI10.1145/1654059.1654117

Cover

More Information
Summary:The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources. We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system.
ISBN:1605587443
9781605587448
ISSN:2167-4329
DOI:10.1145/1654059.1654117