WRA: A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks

As convolutional neural networks (CNNs) become more and more diverse and complicated, acceleration of CNNs increasingly encounters a bottleneck of balancing performance, energy efficiency, and flexibility in a unified architecture. This paper proposed a Winograd-based highly efficient and dynamicall...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems. I, Regular papers Vol. 66; no. 9; pp. 3480 - 3493
Main Authors	Yang, Chen, Wang, Yizhou, Wang, Xiaoli, Geng, Li
Format	Journal Article
Language	English
Published	New York IEEE 01.09.2019 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acceleration Algorithms Artificial neural networks Computer architecture Convolution convolution decomposition Convolutional neural network data reuse Decomposition Energy efficiency Hardware Microprocessors Neural networks Parallel processing Power efficiency reconfigurable Reconfiguration Shape Throughput Winograd
Online Access	Get full text
ISSN	1549-8328 1558-0806
DOI	10.1109/TCSI.2019.2928682

Cover

More Information
Summary:	As convolutional neural networks (CNNs) become more and more diverse and complicated, acceleration of CNNs increasingly encounters a bottleneck of balancing performance, energy efficiency, and flexibility in a unified architecture. This paper proposed a Winograd-based highly efficient and dynamically Reconfigurable Accelerator (named WRA) for quickly evolving CNN models. A cost-effective convolution decomposition method (CDW) was proposed, and it extends the application of the fast Winograd algorithm. Based on CDW, a high-throughput and reconfigurable processing element (PE) array was designed to exploit the parallelism of Winograd. Besides, a highly compact memory structure employed four levels of data reuse schemes to achieve maximal data reuse and minimize external bandwidth requirement. Provided with dynamically reconfigurable capability, WRA implements CDW and other convolutions (e.g., standard convolution, depthwise separable convolution, and group convolution) on a unified hardware architecture. The WRA accelerator was implemented on a Xilinx XCVU9P platform running at 330 MHz clock frequency, controlled by a POWER8 processor via a coherent accelerator processor interface (CAPI) interface. At different configurations, WRA can provide 2.2-6.3 TOPS performance for different convolution shapes. The average performance and energy efficiency for VGG16/AlexNet/MobileNetV1/MobileNetV2 are 5288 GOP/s at 151.2 GOPs/W, 3478 GOP/s at 99.4 GOPs/W, 2674 GOP/s at 76.4 GOPs/W, and 2194 GOP/s at 62.7 GOPs/W. It achieves 1.7\times -24\times speedup compared with the previous FPGA-based designs.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1549-8328 1558-0806
DOI:	10.1109/TCSI.2019.2928682