FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks

Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or ab...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on parallel and distributed systems Vol. 32; no. 7; pp. 1677 - 1689
Main Authors	Zhao, Kai, Di, Sheng, Li, Sihuan, Liang, Xin, Zhai, Yujia, Chen, Jieyang, Ouyang, Kaiming, Cappello, Franck, Chen, Zizhong
Format	Journal Article
Language	English
Published	New York IEEE 01.07.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithm-based fault tolerance Algorithms Artificial neural networks Convolution deep learning Error correcting codes Error correction Error correction codes Fault tolerance Fault tolerant systems High temperature high-performance computing Inference Kernel Mathematical model Multiplication Neural networks reliability Run time (computers) Runtime Safety critical silent data corruption Soft errors Workflow
Online Access	Get full text
ISSN	1045-9219 1558-2183
DOI	10.1109/TPDS.2020.3043449

Cover

Abstract	Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations).
AbstractList	Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%[Formula Omitted]8% in both error-free and error-injected situations). Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations).
Author	Liang, Xin Ouyang, Kaiming Li, Sihuan Chen, Zizhong Zhao, Kai Di, Sheng Cappello, Franck Zhai, Yujia Chen, Jieyang
Author_xml	– sequence: 1 givenname: Kai orcidid: 0000-0001-5328-3962 surname: Zhao fullname: Zhao, Kai email: kzhao016@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 2 givenname: Sheng orcidid: 0000-0002-7339-5256 surname: Di fullname: Di, Sheng email: sdi1@anl.gov organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA – sequence: 3 givenname: Sihuan orcidid: 0000-0001-7315-7955 surname: Li fullname: Li, Sihuan email: sli049@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 4 givenname: Xin orcidid: 0000-0002-0630-1600 surname: Liang fullname: Liang, Xin email: liangx@ornl.gov organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA – sequence: 5 givenname: Yujia surname: Zhai fullname: Zhai, Yujia email: yzhai015@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 6 givenname: Jieyang orcidid: 0000-0002-1905-9171 surname: Chen fullname: Chen, Jieyang email: chen@ucr.edu organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA – sequence: 7 givenname: Kaiming surname: Ouyang fullname: Ouyang, Kaiming email: kouya001@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 8 givenname: Franck surname: Cappello fullname: Cappello, Franck email: cappello@mcs.anl.gov organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA – sequence: 9 givenname: Zizhong surname: Chen fullname: Chen, Zizhong email: chen@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA
BookMark	eNotjs1Kw0AURgepYFt9AHETcD31zl8y466mVgulCsZ1mCQ3mppm6kyi-PYW6-p8i8Phm5BR5zok5JLBjDEwN9nz4mXGgcNMgBRSmhMyZkppypkWo8MGqajhzJyRSQhbACYVyDFZLTOabja30bx9c77p33f0zgasoqUd2j7KXIvediVGtfNR6rov1w594zrbRhsc_B_6b-c_wjk5rW0b8OKfU_K6vM_SR7p-elil8zVtOIie1nXFSjQVqytUcYLGlAkvOLe2LAplLdqKy0TpOrYlB63LxCiuIGaFSgzGIKbk-tjde_c5YOjzrRv84VDIuTRcKxMn8mBdHa0GEfO9b3bW_-RGMKZjIX4BFR5ZYA
CODEN	ITDSEO
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021
DBID	97E RIA RIE 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/TPDS.2020.3043449
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1558-2183
EndPage	1689
ExternalDocumentID	9311863
Genre	orig-research
GrantInformation_xml	– fundername: Exascale Computing Project grantid: 17-SC-20-SC – fundername: National Science Foundation grantid: CCF-1513201; CCF-1619253; OAC-2034169 funderid: 10.13039/100000001 – fundername: U.S. Department of Energy grantid: DE-AC02-06CH11357 funderid: 10.13039/100000015 – fundername: National Nuclear Security Administration funderid: 10.13039/100006168
GroupedDBID	--Z -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS TN5 TWZ UHB 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603
IEDL.DBID	RIE
ISSN	1045-9219
IngestDate	Mon Jun 30 05:06:17 EDT 2025 Wed Aug 27 02:48:54 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	7
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0001-5328-3962 0000-0002-1905-9171 0000-0002-7339-5256 0000-0002-0630-1600 0000-0001-7315-7955
PQID	2492859674
PQPubID	85437
PageCount	13
ParticipantIDs	ieee_primary_9311863 proquest_journals_2492859674
PublicationCentury	2000
PublicationDate	2021-07-01
PublicationDateYYYYMMDD	2021-07-01
PublicationDate_xml	– month: 07 year: 2021 text: 2021-07-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York
PublicationTitle	IEEE transactions on parallel and distributed systems
PublicationTitleAbbrev	TPDS
PublicationYear	2021
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
SSID	ssj0014504
Score	2.6142313
Snippet	Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference...
SourceID	proquest ieee
SourceType	Aggregation Database Publisher
StartPage	1677
SubjectTerms	Algorithm-based fault tolerance Algorithms Artificial neural networks Convolution deep learning Error correcting codes Error correction Error correction codes Fault tolerance Fault tolerant systems High temperature high-performance computing Inference Kernel Mathematical model Multiplication Neural networks reliability Run time (computers) Runtime Safety critical silent data corruption Soft errors Workflow
Title	FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
URI	https://ieeexplore.ieee.org/document/9311863 https://www.proquest.com/docview/2492859674
Volume	32
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2183 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014504 issn: 1045-9219 databaseCode: RIE dateStart: 19900101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwED1BJxj4RhQKysCIS-rETsxWClVBokIilbpF_oSKkqCSMvDrsd0UIWBgSoZYsey7-C737j2AU8kSQ7DASEusUYwFRSlXFEWGKsI0SwV2_c53QzoYxbdjMl6Bs69eGK21B5_ptrv1tXxVyrn7VXbOIhsO02gVVpOULnq1vioGMfFSgTa7IIhZN6wrmJ2QnWf3Vw82E8Q2QQ3jKHa0mV5J5dfn158p_U24W85mASV5bs8r0ZYfP4ga_zvdLdiog8ugu7CGbVjRxQ5sLoUbgtqPd2D9GwvhLtz0M9QbDi-C7vSxnE2qpxd0ac82FfT5fFoFWTnVTn1DBza-DXpl8V5bq32To_bwF48lf9uDUf866w1QrbCAJjiMKmSM6kjNVMcoTWiiGZOJ3TjMuRSCcG7NDMc2ozDUMTmmqUyYK4PSjiAJ0zSM9qFRlIU-gEA69-acCGFwzClnoYqEUWli7IOR5E3YdSuUvy5INPJ6cZrQWu5BXnvPW-5YDFPCaBIf_j3qCNaww5Z42GwLGtVsro9tcFCJE28Vnw0ZuKU
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGICBN-JNBkZcUsd2YjYoVOXRCokgsUV-AqI0CFIGfj22myIEDEzJECuWfRff5b77PoB9xVNLscTIKGwQwZKhTGiGEss05YZnEvt-526PdW7JxR29m4CDr14YY0wAn5mGvw21fF2qof9VdsgTFw6zZBKmKSGEjrq1vmoGhAaxQJdfUMSdI9Y1zGbMD_Pr0xuXC2KXosYkIZ44M2ip_PoAh1OlvQDd8XxGYJKnxrCSDfXxg6rxvxNehPk6vIyOR_awBBNmsAwLY-mGqPbkZZj7xkO4AuftHLV6vaPouH9fvj5WD8_oxJ1uOmqLYb-K8rJvvP6GiVyEG7XKwXttr-5NntwjXAKa_G0VbttneauDao0F9IjjpELW6qYyXDetNpSlhnOVuq3DQigpqRDO0DBxOYVlnssxy1TKfSGUNSVNuWFxsgZTg3Jg1iFS3sGFoFJaTAQTPNaJtDpLrXswUWIDVvwKFS8jGo2iXpwN2B7vQVH7z1vheQwzyllKNv8etQcznbx7VVyd9y63YBZ7pEkA0W7DVPU6NDsuVKjkbrCQT-Zvu_I
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=FT-CNN%3A+Algorithm-Based+Fault+Tolerance+for+Convolutional+Neural+Networks&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Zhao%2C+Kai&rft.au=Di%2C+Sheng&rft.au=Li%2C+Sihuan&rft.au=Liang%2C+Xin&rft.date=2021-07-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=32&rft.issue=7&rft.spage=1677&rft.epage=1689&rft_id=info:doi/10.1109%2FTPDS.2020.3043449&rft.externalDocID=9311863
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon