FMI: Fault Tolerant Messaging Interface for Fast and Transparent Recovery

Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems...

Full description

Saved in:
Bibliographic Details
Published inProceedings - IEEE International Parallel and Distributed Processing Symposium pp. 1225 - 1234
Main Authors Sato, Kento, Moody, Adam, Mohror, Kathryn, Gamblin, Todd, de Supinski, Bronis R., Maruyama, Naoya, Matsuoka, Satoshi
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2014
Subjects
Online AccessGet full text
ISBN1479937991
9781479937998
ISSN1530-2075
DOI10.1109/IPDPS.2014.126

Cover

More Information
Summary:Future supercomputers built with more components will enable larger, higher-fidelity simulations, but at the cost of higher failure rates. Traditional approaches to mitigating failures, such as checkpoint/restart (C/R) to a parallel file system incur large overheads. On future, extreme-scale systems, it is unlikely that traditional C/R will recover a failed application before the next failure occurs. To address this problem, we present the Fault Tolerant Messaging Interface (FMI), which enables extremely low-latency recovery. FMI accomplishes this using a survivable communication runtime coupled with fast, in-memory C/R, and dynamic node allocation. FMI provides message-passing semantics similar to MPI, but applications written using FMI can run through failures. The FMI runtime software handles fault tolerance, including check pointing application state, restarting failed processes, and allocating additional nodes when needed. Our tests show that FMI runs with similar failure-free performance as MPI, but FMI incurs only a 28% overhead with a very high mean time between failures of 1 minute.
ISBN:1479937991
9781479937998
ISSN:1530-2075
DOI:10.1109/IPDPS.2014.126