Leveraging 3D PCRAM technologies to reduce checkpoint overhead for future exascale systems
The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that...
Saved in:
| Published in | Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
New York, NY, USA
ACM
14.11.2009
|
| Series | ACM Conferences |
| Subjects |
Software and its engineering
> Software creation and management
> Software verification and validation
Software and its engineering
> Software creation and management
> Software verification and validation
> Operational analysis
Software and its engineering
> Software notations and tools
> General programming languages
> Language types
> Parallel programming languages
|
| Online Access | Get full text |
| ISBN | 1605587443 9781605587448 |
| ISSN | 2167-4329 |
| DOI | 10.1145/1654059.1654117 |
Cover
| Summary: | The scalability of future massively parallel processing (MPP) systems is challenged by high failure rates. Current hard disk drive (HDD) checkpointing results in overhead of 25% or more at the petascale. With a direct correlation between checkpoint frequencies and node counts, novel techniques that can take more frequent checkpoints with minimum overhead are critical to implement a reliable exascale system. In this work, we leverage the upcoming Phase-Change Random Access Memory (PCRAM) technology and propose a hybrid local/global checkpointing mechanism after a thorough analysis of MPP systems failure rates and failure sources.
We propose three variants of PCRAM-based hybrid checkpointing schemes, DIMM+HDD, DIMM+DIMM, and 3D+3D, to reduce the checkpoint overhead and offer a smooth transition from the conventional pure HDD checkpoint to the ideal 3D PCRAM mechanism. The proposed pure 3D PCRAM-based mechanism can ultimately take checkpoints with overhead less than 4% on a projected exascale system. |
|---|---|
| ISBN: | 1605587443 9781605587448 |
| ISSN: | 2167-4329 |
| DOI: | 10.1145/1654059.1654117 |