Implementation of energy-efficient fast convolution algorithm for deep convolutional neural networks based on FPGA

The state-of-the-art convolutional neural networks (CNNs) have been widely applied to many deep neural networks models. As the model becomes more accurate, both the number of computation and the data accesses are significantly increased. The proposed design uses the row stationary with network-on-ch...

Full description

Saved in:

Bibliographic Details
Published in	Electronics letters Vol. 56; no. 10; pp. 485 - 488
Main Authors	Li, W.-J, Ruan, S.-J, Yang, D.-S
Format	Journal Article
Language	English
Published	The Institution of Engineering and Technology 14.05.2020
Subjects	Circuits and systems convolution data accesses deep convolutional neural networks deep neural networks models energy‐efficient fast convolution algorithm field programmable gate arrays network‐on‐chip neural nets off‐chip memory accesses row stationary state‐of‐the‐art convolutional neural networks state‐of‐the‐art work state-of-the-art work network-on-chip energy-efficient fast convolution algorithm field programmable gate arrays row stationary state-of-the-art convolutional neural networks data accesses convolution deep convolutional neural networks neural nets deep neural networks models off-chip memory accesses
Online Access	Get full text
ISSN	0013-5194 1350-911X 1350-911X
DOI	10.1049/el.2019.4188

Cover

More Information
Summary:	The state-of-the-art convolutional neural networks (CNNs) have been widely applied to many deep neural networks models. As the model becomes more accurate, both the number of computation and the data accesses are significantly increased. The proposed design uses the row stationary with network-on-chip and the fast convolution algorithm in process elements to reduce the number of computation and data accesses simultaneously. The experimental evaluation which using the CNN layers of VGG-16 with a batch size of three shows that the proposed design is more energy-efficient than the state-of-the-art work. The proposed design improves the total GOPs of the algorithm by 1.497 times and reduces the on-chip memory and off-chip memory accesses by 1.07 and 1.46 times than prior work, respectively.
ISSN:	0013-5194 1350-911X 1350-911X
DOI:	10.1049/el.2019.4188