Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers...
Saved in:
| Published in | 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines pp. 162 - 169 |
|---|---|
| Main Authors | , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.05.2011
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 9781612842776 1612842771 |
| DOI | 10.1109/FCCM.2011.22 |
Cover
| Summary: | As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers the ability of an FPGA to address another, increasingly important, feature - resiliency. Specifically, a minimally-invasive monitoring infrastructure operating over a sideband network is presented. This includes a multi-chip protocol, IP cores that implement the protocol, and a tool to instrument existing hardware accelerator FPGA designs. To demonstrate the functionality, the system has been implemented on a cluster of FPGA devices running off-the-shelf MPI and Linux. We demonstrate the ability to do integrated software and hardware accelerator check pointing with restart under a variety of injected faults. |
|---|---|
| ISBN: | 9781612842776 1612842771 |
| DOI: | 10.1109/FCCM.2011.22 |