FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks
Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or ab...
Saved in:
| Published in | IEEE transactions on parallel and distributed systems Vol. 32; no. 7; pp. 1677 - 1689 |
|---|---|
| Main Authors | , , , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
IEEE
01.07.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 1045-9219 1558-2183 |
| DOI | 10.1109/TPDS.2020.3043449 |
Cover
| Abstract | Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations). |
|---|---|
| AbstractList | Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%[Formula Omitted]8% in both error-free and error-injected situations). Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference applications have been deployed in safety-critical systems, which may suffer from soft errors caused by high-energy particles, high temperature, or abnormal voltage. Of critical importance is ensuring the stability of the CNN inference process against soft errors. Traditional fault tolerance methods are not suitable for CNN inference because error-correcting code is unable to protect computational components, instruction duplication techniques incur high overhead, and existing algorithm-based fault tolerance (ABFT) techniques cannot protect all convolution implementations. In this article, we focus on how to protect the CNN inference process against soft errors as efficiently as possible, with the following three contributions. (1) We propose several systematic ABFT schemes based on checksum techniques and analyze their fault protection ability and runtime thoroughly. Unlike traditional ABFT based on matrix-matrix multiplication, our schemes support any convolution implementations. (2) We design a novel workflow integrating all the proposed schemes to obtain a high detection/correction ability with limited total runtime overhead. (3) We perform our evaluation using ImageNet with well-known CNN models including AlexNet, VGG-19, ResNet-18, and YOLOv2. Experimental results demonstrate that our implementation can handle soft errors with very limited runtime overhead (4%<inline-formula><tex-math notation="LaTeX">\sim</tex-math> <mml:math><mml:mo>∼</mml:mo></mml:math><inline-graphic xlink:href="zhao-ieq1-3043449.gif"/> </inline-formula>8% in both error-free and error-injected situations). |
| Author | Liang, Xin Ouyang, Kaiming Li, Sihuan Chen, Zizhong Zhao, Kai Di, Sheng Cappello, Franck Zhai, Yujia Chen, Jieyang |
| Author_xml | – sequence: 1 givenname: Kai orcidid: 0000-0001-5328-3962 surname: Zhao fullname: Zhao, Kai email: kzhao016@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 2 givenname: Sheng orcidid: 0000-0002-7339-5256 surname: Di fullname: Di, Sheng email: sdi1@anl.gov organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA – sequence: 3 givenname: Sihuan orcidid: 0000-0001-7315-7955 surname: Li fullname: Li, Sihuan email: sli049@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 4 givenname: Xin orcidid: 0000-0002-0630-1600 surname: Liang fullname: Liang, Xin email: liangx@ornl.gov organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA – sequence: 5 givenname: Yujia surname: Zhai fullname: Zhai, Yujia email: yzhai015@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 6 givenname: Jieyang orcidid: 0000-0002-1905-9171 surname: Chen fullname: Chen, Jieyang email: chen@ucr.edu organization: Oak Ridge National Laboratory, Computer Science and Mathematics Division, Oak Ridge, TN, USA – sequence: 7 givenname: Kaiming surname: Ouyang fullname: Ouyang, Kaiming email: kouya001@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA – sequence: 8 givenname: Franck surname: Cappello fullname: Cappello, Franck email: cappello@mcs.anl.gov organization: Argonne National Laboratory, Mathematics and Computer Science Division, Lemont, IL, USA – sequence: 9 givenname: Zizhong surname: Chen fullname: Chen, Zizhong email: chen@ucr.edu organization: Department of Computer Science and Engineering, University of California, Riverside, Riverside, CA, USA |
| BookMark | eNotjs1Kw0AURgepYFt9AHETcD31zl8y466mVgulCsZ1mCQ3mppm6kyi-PYW6-p8i8Phm5BR5zok5JLBjDEwN9nz4mXGgcNMgBRSmhMyZkppypkWo8MGqajhzJyRSQhbACYVyDFZLTOabja30bx9c77p33f0zgasoqUd2j7KXIvediVGtfNR6rov1w594zrbRhsc_B_6b-c_wjk5rW0b8OKfU_K6vM_SR7p-elil8zVtOIie1nXFSjQVqytUcYLGlAkvOLe2LAplLdqKy0TpOrYlB63LxCiuIGaFSgzGIKbk-tjde_c5YOjzrRv84VDIuTRcKxMn8mBdHa0GEfO9b3bW_-RGMKZjIX4BFR5ZYA |
| CODEN | ITDSEO |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2021 |
| DBID | 97E RIA RIE 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TPDS.2020.3043449 |
| DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1558-2183 |
| EndPage | 1689 |
| ExternalDocumentID | 9311863 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: Exascale Computing Project grantid: 17-SC-20-SC – fundername: National Science Foundation grantid: CCF-1513201; CCF-1619253; OAC-2034169 funderid: 10.13039/100000001 – fundername: U.S. Department of Energy grantid: DE-AC02-06CH11357 funderid: 10.13039/100000015 – fundername: National Nuclear Security Administration funderid: 10.13039/100006168 |
| GroupedDBID | --Z -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS TN5 TWZ UHB 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603 |
| IEDL.DBID | RIE |
| ISSN | 1045-9219 |
| IngestDate | Mon Jun 30 05:06:17 EDT 2025 Wed Aug 27 02:48:54 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i203t-ffd1ce9d1fde567e99c72b22aacbb5aaead24758f6ac2088c79525061b579e603 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0001-5328-3962 0000-0002-1905-9171 0000-0002-7339-5256 0000-0002-0630-1600 0000-0001-7315-7955 |
| PQID | 2492859674 |
| PQPubID | 85437 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_9311863 proquest_journals_2492859674 |
| PublicationCentury | 2000 |
| PublicationDate | 2021-07-01 |
| PublicationDateYYYYMMDD | 2021-07-01 |
| PublicationDate_xml | – month: 07 year: 2021 text: 2021-07-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on parallel and distributed systems |
| PublicationTitleAbbrev | TPDS |
| PublicationYear | 2021 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| SSID | ssj0014504 |
| Score | 2.6142313 |
| Snippet | Convolutional neural networks (CNNs) are becoming more and more important for solving challenging and critical problems in many fields. CNN inference... |
| SourceID | proquest ieee |
| SourceType | Aggregation Database Publisher |
| StartPage | 1677 |
| SubjectTerms | Algorithm-based fault tolerance Algorithms Artificial neural networks Convolution deep learning Error correcting codes Error correction Error correction codes Fault tolerance Fault tolerant systems High temperature high-performance computing Inference Kernel Mathematical model Multiplication Neural networks reliability Run time (computers) Runtime Safety critical silent data corruption Soft errors Workflow |
| Title | FT-CNN: Algorithm-Based Fault Tolerance for Convolutional Neural Networks |
| URI | https://ieeexplore.ieee.org/document/9311863 https://www.proquest.com/docview/2492859674 |
| Volume | 32 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2183 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014504 issn: 1045-9219 databaseCode: RIE dateStart: 19900101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwED1BJxj4RhQKysCIS-rETsxWClVBokIilbpF_oSKkqCSMvDrsd0UIWBgSoZYsey7-C737j2AU8kSQ7DASEusUYwFRSlXFEWGKsI0SwV2_c53QzoYxbdjMl6Bs69eGK21B5_ptrv1tXxVyrn7VXbOIhsO02gVVpOULnq1vioGMfFSgTa7IIhZN6wrmJ2QnWf3Vw82E8Q2QQ3jKHa0mV5J5dfn158p_U24W85mASV5bs8r0ZYfP4ga_zvdLdiog8ugu7CGbVjRxQ5sLoUbgtqPd2D9GwvhLtz0M9QbDi-C7vSxnE2qpxd0ac82FfT5fFoFWTnVTn1DBza-DXpl8V5bq32To_bwF48lf9uDUf866w1QrbCAJjiMKmSM6kjNVMcoTWiiGZOJ3TjMuRSCcG7NDMc2ozDUMTmmqUyYK4PSjiAJ0zSM9qFRlIU-gEA69-acCGFwzClnoYqEUWli7IOR5E3YdSuUvy5INPJ6cZrQWu5BXnvPW-5YDFPCaBIf_j3qCNaww5Z42GwLGtVsro9tcFCJE28Vnw0ZuKU |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGICBN-JNBkZcUsd2YjYoVOXRCokgsUV-AqI0CFIGfj22myIEDEzJECuWfRff5b77PoB9xVNLscTIKGwQwZKhTGiGEss05YZnEvt-526PdW7JxR29m4CDr14YY0wAn5mGvw21fF2qof9VdsgTFw6zZBKmKSGEjrq1vmoGhAaxQJdfUMSdI9Y1zGbMD_Pr0xuXC2KXosYkIZ44M2ip_PoAh1OlvQDd8XxGYJKnxrCSDfXxg6rxvxNehPk6vIyOR_awBBNmsAwLY-mGqPbkZZj7xkO4AuftHLV6vaPouH9fvj5WD8_oxJ1uOmqLYb-K8rJvvP6GiVyEG7XKwXttr-5NntwjXAKa_G0VbttneauDao0F9IjjpELW6qYyXDetNpSlhnOVuq3DQigpqRDO0DBxOYVlnssxy1TKfSGUNSVNuWFxsgZTg3Jg1iFS3sGFoFJaTAQTPNaJtDpLrXswUWIDVvwKFS8jGo2iXpwN2B7vQVH7z1vheQwzyllKNv8etQcznbx7VVyd9y63YBZ7pEkA0W7DVPU6NDsuVKjkbrCQT-Zvu_I |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=FT-CNN%3A+Algorithm-Based+Fault+Tolerance+for+Convolutional+Neural+Networks&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Zhao%2C+Kai&rft.au=Di%2C+Sheng&rft.au=Li%2C+Sihuan&rft.au=Liang%2C+Xin&rft.date=2021-07-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=32&rft.issue=7&rft.spage=1677&rft.epage=1689&rft_id=info:doi/10.1109%2FTPDS.2020.3043449&rft.externalDocID=9311863 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon |