An FPGA-Based High-Throughput Dataflow Accelerator for Lightweight Neural Network

Lightweight neural networks (LWNNs) have drawn significant attention recently for compact architecture and acceptable accuracy. Despite achieving substantial reductions in computation complexity and model size, increased memory access demands are caused by the extensive use of depthwise separable co...

Full description

Saved in:

Bibliographic Details
Published in	IEEE International Symposium on Circuits and Systems proceedings pp. 1 - 5
Main Authors	Zhao, Zhiyuan, Li, Jixing, Chen, Gang, Jiang, Zhelong, Qiao, Ruixiu, Xu, Peng, Chen, Yihao, Lu, Huaxiang
Format	Conference Proceeding
Language	English
Published	IEEE 19.05.2024
Subjects	Bandwidth Computational efficiency Computer architecture computing engine(CE) dataflow FPGA accelerator Heuristic algorithms Lightweight neural network(LWNN) Neural networks Parallel processing Performance evaluation
Online Access	Get full text
ISSN	2158-1525
DOI	10.1109/ISCAS58744.2024.10558315

Cover

More Information
Summary:	Lightweight neural networks (LWNNs) have drawn significant attention recently for compact architecture and acceptable accuracy. Despite achieving substantial reductions in computation complexity and model size, increased memory access demands are caused by the extensive use of depthwise separable convolutions (DSCs) and skip-connection blocks (SCBs), which makes it difficult to achieve the anticipated performance. To process LWNNs efficiently, an FPGA-based dataflow accelerator is proposed in this paper. Firstly, a pixel-based streaming strategy is introduced to reduce off-chip memory access while minimizing on-chip memory overhead. Furthermore, an adaptive bandwidth computing engine (CE) is designed to increase computational efficiency in multi-CE architecture. Finally, based on the scalable CE, a dynamic parallelism allocation algorithm is proposed to avoid underutilization of on-chip computing resources. Shuf-fleNetV2 is implemented on Xilinx ZC706 platform, and the results show the proposed accelerator can achieve a state-of-the-art performance of 1771.2 FPS and computational efficiency of 0.64 GOPS/DSP, which is 5.3× of the reference design.
ISSN:	2158-1525
DOI:	10.1109/ISCAS58744.2024.10558315