FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or ab...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on parallel and distributed systems Vol. 32; no. 7; pp. 1677 - 1689
Main Authors Zhao, Kai, Di, Sheng, Li, Sihuan, Liang, Xin, Zhai, Yujia, Chen, Jieyang, Ouyang, Kaiming, Cappello, Franck, Chen, Zizhong
Format Journal Article
LanguageEnglish
Published New York IEEE 01.07.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1045-9219
1558-2183
DOI10.1109/TPDS.2020.3043449

Cover

Abstract Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations).
AbstractList Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%[Formula Omitted]8% in both error-free and error-injected situations).
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations).
Author Liang, Xin
Ouyang, Kaiming
Li, Sihuan
Chen, Zizhong
Zhao, Kai
Di, Sheng
Cappello, Franck
Zhai, Yujia
Chen, Jieyang
Author_xml – sequence: 1
  givenname: Kai
  orcidid: 0000-0001-5328-3962
  surname: Zhao
  fullname: Zhao, Kai
  email: kzhao016@ucr.edu
  organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
– sequence: 2
  givenname: Sheng
  orcidid: 0000-0002-7339-5256
  surname: Di
  fullname: Di, Sheng
  email: sdi1@anl.gov
  organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA
– sequence: 3
  givenname: Sihuan
  orcidid: 0000-0001-7315-7955
  surname: Li
  fullname: Li, Sihuan
  email: sli049@ucr.edu
  organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
– sequence: 4
  givenname: Xin
  orcidid: 0000-0002-0630-1600
  surname: Liang
  fullname: Liang, Xin
  email: liangx@ornl.gov
  organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA
– sequence: 5
  givenname: Yujia
  surname: Zhai
  fullname: Zhai, Yujia
  email: yzhai015@ucr.edu
  organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
– sequence: 6
  givenname: Jieyang
  orcidid: 0000-0002-1905-9171
  surname: Chen
  fullname: Chen, Jieyang
  email: chen@ucr.edu
  organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA
– sequence: 7
  givenname: Kaiming
  surname: Ouyang
  fullname: Ouyang, Kaiming
  email: kouya001@ucr.edu
  organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
– sequence: 8
  givenname: Franck
  surname: Cappello
  fullname: Cappello, Franck
  email: cappello@mcs.anl.gov
  organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA
– sequence: 9
  givenname: Zizhong
  surname: Chen
  fullname: Chen, Zizhong
  email: chen@ucr.edu
  organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
BookMark eNotjs1Kw0AURgepYFt9AHETcD31zl8y466mVgulCsZ1mCQ3mppm6kyi-PYW6-p8i8Phm5BR5zok5JLBjDEwN9nz4mXGgcNMgBRSmhMyZkppypkWo8MGqajhzJyRSQhbACYVyDFZLTOabja30bx9c77p33f0zgasoqUd2j7KXIvediVGtfNR6rov1w594zrbRhsc_B_6b-c_wjk5rW0b8OKfU_K6vM_SR7p-elil8zVtOIie1nXFSjQVqytUcYLGlAkvOLe2LAplLdqKy0TpOrYlB63LxCiuIGaFSgzGIKbk-tjde_c5YOjzrRv84VDIuTRcKxMn8mBdHa0GEfO9b3bW_-RGMKZjIX4BFR5ZYA
CODEN ITDSEO
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID 97E
RIA
RIE
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TPDS.2020.3043449
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2183
EndPage 1689
ExternalDocumentID 9311863
Genre orig-research
GrantInformation_xml – fundername: Exascale Computing Project
  grantid: 17-SC-20-SC
– fundername: National Science Foundation
  grantid: CCF-1513201; CCF-1619253; OAC-2034169
  funderid: 10.13039/100000001
– fundername: U.S. Department of Energy
  grantid: DE-AC02-06CH11357
  funderid: 10.13039/100000015
– fundername: National Nuclear Security Administration
  funderid: 10.13039/100006168
GroupedDBID --Z
-~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
TWZ
UHB
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603
IEDL.DBID RIE
ISSN 1045-9219
IngestDate Mon Jun 30 05:06:17 EDT 2025
Wed Aug 27 02:48:54 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 7
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-5328-3962
0000-0002-1905-9171
0000-0002-7339-5256
0000-0002-0630-1600
0000-0001-7315-7955
PQID 2492859674
PQPubID 85437
PageCount 13
ParticipantIDs ieee_primary_9311863
proquest_journals_2492859674
PublicationCentury 2000
PublicationDate 2021-07-01
PublicationDateYYYYMMDD 2021-07-01
PublicationDate_xml – month: 07
  year: 2021
  text: 2021-07-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on parallel and distributed systems
PublicationTitleAbbrev TPDS
PublicationYear 2021
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
SSID ssj0014504
Score 2.6142313
Snippet Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference...
SourceID proquest
ieee
SourceType Aggregation Database
Publisher
StartPage 1677
SubjectTerms Algorithm-based fault tolerance
Algorithms
Artificial neural networks
Convolution
deep learning
Error correcting codes
Error correction
Error correction codes
Fault tolerance
Fault tolerant systems
High temperature
high-performance computing
Inference
Kernel
Mathematical model
Multiplication
Neural networks
reliability
Run time (computers)
Runtime
Safety critical
silent data corruption
Soft errors
Workflow
Title FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
URI https://ieeexplore.ieee.org/document/9311863
https://www.proquest.com/docview/2492859674
Volume 32
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2183
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0014504
  issn: 1045-9219
  databaseCode: RIE
  dateStart: 19900101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwED1BJxj4RhQKysCIS-rETsxWClVBokIilbpF_oSKkqCSMvDrsd0UIWBgSoZYsey7-C737j2AU8kSQ7DASEusUYwFRSlXFEWGKsI0SwV2_c53QzoYxbdjMl6Bs69eGK21B5_ptrv1tXxVyrn7VXbOIhsO02gVVpOULnq1vioGMfFSgTa7IIhZN6wrmJ2QnWf3Vw82E8Q2QQ3jKHa0mV5J5dfn158p_U24W85mASV5bs8r0ZYfP4ga_zvdLdiog8ugu7CGbVjRxQ5sLoUbgtqPd2D9GwvhLtz0M9QbDi-C7vSxnE2qpxd0ac82FfT5fFoFWTnVTn1DBza-DXpl8V5bq32To_bwF48lf9uDUf866w1QrbCAJjiMKmSM6kjNVMcoTWiiGZOJ3TjMuRSCcG7NDMc2ozDUMTmmqUyYK4PSjiAJ0zSM9qFRlIU-gEA69-acCGFwzClnoYqEUWli7IOR5E3YdSuUvy5INPJ6cZrQWu5BXnvPW-5YDFPCaBIf_j3qCNaww5Z42GwLGtVsro9tcFCJE28Vnw0ZuKU
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGICBN-JNBkZcUsd2YjYoVOXRCokgsUV-AqI0CFIGfj22myIEDEzJECuWfRff5b77PoB9xVNLscTIKGwQwZKhTGiGEss05YZnEvt-526PdW7JxR29m4CDr14YY0wAn5mGvw21fF2qof9VdsgTFw6zZBKmKSGEjrq1vmoGhAaxQJdfUMSdI9Y1zGbMD_Pr0xuXC2KXosYkIZ44M2ip_PoAh1OlvQDd8XxGYJKnxrCSDfXxg6rxvxNehPk6vIyOR_awBBNmsAwLY-mGqPbkZZj7xkO4AuftHLV6vaPouH9fvj5WD8_oxJ1uOmqLYb-K8rJvvP6GiVyEG7XKwXttr-5NntwjXAKa_G0VbttneauDao0F9IjjpELW6qYyXDetNpSlhnOVuq3DQigpqRDO0DBxOYVlnssxy1TKfSGUNSVNuWFxsgZTg3Jg1iFS3sGFoFJaTAQTPNaJtDpLrXswUWIDVvwKFS8jGo2iXpwN2B7vQVH7z1vheQwzyllKNv8etQcznbx7VVyd9y63YBZ7pEkA0W7DVPU6NDsuVKjkbrCQT-Zvu_I
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=FT-CNN%3A+Algorithm-Based+Fault+Tolerance+for+Convolutional+Neural+Networks&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Zhao%2C+Kai&rft.au=Di%2C+Sheng&rft.au=Li%2C+Sihuan&rft.au=Liang%2C+Xin&rft.date=2021-07-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=32&rft.issue=7&rft.spage=1677&rft.epage=1689&rft_id=info:doi/10.1109%2FTPDS.2020.3043449&rft.externalDocID=9311863
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon