WRA: A 2.2-to-6.3 TOPS Highly Unified Dynamically Reconfigurable Accelerator Using a Novel Winograd Decomposition Algorithm for Convolutional Neural Networks

As convolutional neural networks (CNNs) become more and more diverse and complicated, acceleration of CNNs increasingly encounters a bottleneck of balancing performance, energy efficiency, and flexibility in a unified architecture. This paper proposed a Winograd-based highly efficient and dynamicall...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems. I, Regular papers Vol. 66; no. 9; pp. 3480 - 3493
Main Authors Yang, Chen, Wang, Yizhou, Wang, Xiaoli, Geng, Li
Format Journal Article
LanguageEnglish
Published New York IEEE 01.09.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1549-8328
1558-0806
DOI10.1109/TCSI.2019.2928682

Cover

More Information
Summary:As convolutional neural networks (CNNs) become more and more diverse and complicated, acceleration of CNNs increasingly encounters a bottleneck of balancing performance, energy efficiency, and flexibility in a unified architecture. This paper proposed a Winograd-based highly efficient and dynamically Reconfigurable Accelerator (named WRA) for quickly evolving CNN models. A cost-effective convolution decomposition method (CDW) was proposed, and it extends the application of the fast Winograd algorithm. Based on CDW, a high-throughput and reconfigurable processing element (PE) array was designed to exploit the parallelism of Winograd. Besides, a highly compact memory structure employed four levels of data reuse schemes to achieve maximal data reuse and minimize external bandwidth requirement. Provided with dynamically reconfigurable capability, WRA implements CDW and other convolutions (e.g., standard convolution, depthwise separable convolution, and group convolution) on a unified hardware architecture. The WRA accelerator was implemented on a Xilinx XCVU9P platform running at 330 MHz clock frequency, controlled by a POWER8 processor via a coherent accelerator processor interface (CAPI) interface. At different configurations, WRA can provide 2.2-6.3 TOPS performance for different convolution shapes. The average performance and energy efficiency for VGG16/AlexNet/MobileNetV1/MobileNetV2 are 5288 GOP/s at 151.2 GOPs/W, 3478 GOP/s at 99.4 GOPs/W, 2674 GOP/s at 76.4 GOPs/W, and 2194 GOP/s at 62.7 GOPs/W. It achieves 1.7\times -24\times speedup compared with the previous FPGA-based designs.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1549-8328
1558-0806
DOI:10.1109/TCSI.2019.2928682