Analyzing and Increasing the Reliability of Convolutional Neural Networks on GPUs

Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliabilit...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on reliability Vol. 68; no. 2; pp. 663 - 677
Main Authors	Santos, Fernando Fernandes dos, Pimenta, Pedro Foletto, Lunardi, Caio, Draghetti, Lucas, Carro, Luigi, Kaeli, David, Rech, Paolo
Format	Journal Article
Language	English
Published	New York IEEE 01.06.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithm-based fault tolerance (ABFT) Algorithms Artificial neural networks convolutional neural networks (CNNs) embedded systems Error correcting codes Error correction Error correction codes Fault tolerance Fault tolerant systems Graphics processing units Hardware Image detection Multiplication Network reliability Neural networks Neutron beams Object recognition reliability Reliability analysis Safety critical soft errors
Online Access	Get full text
ISSN	0018-9529 1558-1721
DOI	10.1109/TR.2018.2878387

Cover

More Information
Summary:	Graphics processing units (GPUs) are playing a critical role in convolutional neural networks (CNNs) for image detection. As GPU-enabled CNNs move into safety-critical environments, reliability is becoming a growing concern. In this paper, we evaluate and propose strategies to improve the reliability of object detection algorithms, as run on three NVIDIA GPU architectures. We consider three algorithms: 1) you only look once; 2) a faster region-based CNN (Faster R-CNN); and 3) a residual network, exposing live hardware to neutron beams. We complement our beam experiments with fault injection to better characterize fault propagation in CNNs. We show that a single fault occurring in a GPU tends to propagate to multiple active threads, significantly reducing the reliability of a CNN. Moreover, relying on error correcting codes dramatically reduces the number of silent data corruptions (SDCs), but does not reduce the number of critical errors (i.e., errors that could potentially impact safety-critical applications). Based on observations on how faults propagate on GPU architectures, we propose effective strategies to improve CNN reliability. We also consider the benefits of using an algorithm-based fault-tolerance technique for matrix multiplication, which can correct more than 87% of the critical SDCs in a CNN, while redesigning maxpool layers of the CNN to detect up to 98% of critical SDCs.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9529 1558-1721
DOI:	10.1109/TR.2018.2878387