Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs

As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers...

Full description

Saved in:
Bibliographic Details
Published in2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines pp. 162 - 169
Main Authors Schmidt, A G, Bin Huang, Sass, R, French, M
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2011
Subjects
Online AccessGet full text
ISBN9781612842776
1612842771
DOI10.1109/FCCM.2011.22

Cover

Abstract As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers the ability of an FPGA to address another, increasingly important, feature - resiliency. Specifically, a minimally-invasive monitoring infrastructure operating over a sideband network is presented. This includes a multi-chip protocol, IP cores that implement the protocol, and a tool to instrument existing hardware accelerator FPGA designs. To demonstrate the functionality, the system has been implemented on a cluster of FPGA devices running off-the-shelf MPI and Linux. We demonstrate the ability to do integrated software and hardware accelerator check pointing with restart under a variety of injected faults.
AbstractList As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient computation and application-specific acceleration benefits, as well as tighter integration between compute and I/O resources. This paper considers the ability of an FPGA to address another, increasingly important, feature - resiliency. Specifically, a minimally-invasive monitoring infrastructure operating over a sideband network is presented. This includes a multi-chip protocol, IP cores that implement the protocol, and a tool to instrument existing hardware accelerator FPGA designs. To demonstrate the functionality, the system has been implemented on a cluster of FPGA devices running off-the-shelf MPI and Linux. We demonstrate the ability to do integrated software and hardware accelerator check pointing with restart under a variety of injected faults.
Author Bin Huang
Schmidt, A G
Sass, R
French, M
Author_xml – sequence: 1
  givenname: A G
  surname: Schmidt
  fullname: Schmidt, A G
  email: andrewgschmidt@gmail.com
  organization: Reconfigurable Comput. Syst. Lab., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
– sequence: 2
  surname: Bin Huang
  fullname: Bin Huang
  email: bhuang2@uncc.edu
  organization: Reconfigurable Comput. Syst. Lab., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
– sequence: 3
  givenname: R
  surname: Sass
  fullname: Sass, R
  email: rsass@uncc.edu
  organization: Reconfigurable Comput. Syst. Lab., Univ. of North Carolina at Charlotte, Charlotte, NC, USA
– sequence: 4
  givenname: M
  surname: French
  fullname: French, M
  email: mfrench@isi.edu
  organization: Inf. Sci. Inst., Univ. of Southern California, Arlington, SC, USA
BookMark eNotjsFOAjEURWvUREF27tz0Bxj62unr1B1OBExQidE16UzfQBU6ZKbG8PeS6N2c5CxO7oBdxDYSY7cgMgBhJ7OyfM6kAMikPGMDYdDqXAnIz9nImgIQZJFLY_CKjfr-U5yGaBXiNXspt1R_HdoQ0-SN-uS6xF30_IGObfT3_OTCLlBMfBE2W76irmm7vYs18bLdH75TiBv-E9KWz1bzaX_DLhu362n0zyH7mD2-l4vx8nX-VE6X4wBGp7GuvawIZeGgVlXt0WsjTo-MrcGRzrUXlUUpKgQHhqzySqAXIBvnjQSthuzurxuIaH3owt51x7U2BiQW6hcI3E9L
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/FCCM.2011.22
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
Accès ENAC - IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 0769543014
9780769543017
EndPage 169
ExternalDocumentID 5771268
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADFMO
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i175t-5cd2be628a1c3bcd6d57069379c1ae545d0b9620b61a17e93d306d012fad72153
IEDL.DBID RIE
ISBN 9781612842776
1612842771
IngestDate Wed Aug 27 02:53:36 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-5cd2be628a1c3bcd6d57069379c1ae545d0b9620b61a17e93d306d012fad72153
PageCount 8
ParticipantIDs ieee_primary_5771268
PublicationCentury 2000
PublicationDate 2011-May
PublicationDateYYYYMMDD 2011-05-01
PublicationDate_xml – month: 05
  year: 2011
  text: 2011-May
PublicationDecade 2010
PublicationTitle 2011 IEEE 19th Annual International Symposium on Field-Programmable Custom Computing Machines
PublicationTitleAbbrev fccm
PublicationYear 2011
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000669366
ssib026766457
Score 1.5545647
Snippet As FPGA resources continue to increase, FPGAs present attractive features to the High Performance Computing community. These include the power-efficient...
SourceID ieee
SourceType Publisher
StartPage 162
SubjectTerms Amplitude modulation
Checkpoint Restart
Context
Field programmable gate arrays
FPGA
Hardware
High Performance Computing
Monitoring
Reconfigurable Computing
Registers
Resiliency
Software
Title Checkpoint/Restart and Beyond: Resilient High Performance Computing with FPGAs
URI https://ieeexplore.ieee.org/document/5771268
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF7anjyptOKbPXg0NZvHpPEmwViEliIWeiv7mGCppEXTi7_e2WzainjwthnIsjv7mJ3HN8PYjc365yMdwCAhFSWKNHgyLpSXxEYWpAWlAVqP7mgMw2n0PItnLXa7w8IgYh18hn3brH35ZqU31lRGynsiAhi0WTsZgMNqbfdOAAlA1Hjw3C0MaQhgsVxg72Aai2hSPG2_YRcIn97lWTZyCT1tGd0fhVZqOZMfstF2hC68ZNnfVKqvv34lb_zvFI5Yb4_o45OdrDpmLSy7bJy9oV6uVwv644WkA-0iLkvDHarlnhNt8W7xktxGg_DJHmPAXTEI6opbQy7PJ08Pnz02zR9fs6HX1FfwFvRoqLxYm0AhBAMpdKi0ARMnPnEtSbWQSE8r46sUAl-BkCLBNDSkXxiSaIU0pDjG4QnrlKsSTxkncoh0mFXhR5EUmMpYR0ZQ_1ITu_GMdS0v5muXQmPesOH8b_IFO3CmWxtXeMk61ccGr0j2V-q6XvRv2qeo1A
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKGWAC1CLeeGAkJU5sp2FDEaFAU1WolbpVfkVURWkF6cKv5xynBSEGtuSkWM7p7Ht-dwhd2a5_voEDGETgolCquCdYLr2IaZGDFxQHxmZ0swHvjenThE0a6HqDhTHGVMVnpmMfq1y-XqiVDZWB8x6RgHe30DajlDKH1lpLT8Ajzmmdw3P3MI9Dzi2ai9tbGHZD6iZP63e-KYWPb9IkyVxLTztI98eolUrTpHsoW-_RFZjMO6tSdtTnr_aN__2JfdT-xvTh4UZbHaCGKVpokLwaNV8uZvDFC-gHkCMsCo0druUWA232ZhGT2NaD4OE3ygC7cRCwFLahXJwOH-4-2mic3o-SnldPWPBmYDaUHlM6kIYHXUFUKJXmmkU-cC2KFREGjCvty5gHvuREkMjEoQYPQ4NOy4UG15GFh6hZLApzhDCQQwPHWeY-pYKYWDBFNYH1hQJ2m2PUsryYLl0TjWnNhpO_yZdopzfK-tP-4-D5FO26QK6tMjxDzfJ9Zc7BEijlRSUAX-hsrCE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2011+IEEE+19th+Annual+International+Symposium+on+Field-Programmable+Custom+Computing+Machines&rft.atitle=Checkpoint%2FRestart+and+Beyond%3A+Resilient+High+Performance+Computing+with+FPGAs&rft.au=Schmidt%2C+A+G&rft.au=Bin+Huang&rft.au=Sass%2C+R&rft.au=French%2C+M&rft.date=2011-05-01&rft.pub=IEEE&rft.isbn=9781612842776&rft.spage=162&rft.epage=169&rft_id=info:doi/10.1109%2FFCCM.2011.22&rft.externalDocID=5771268
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781612842776/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781612842776/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781612842776/sc.gif&client=summon&freeimage=true