A fully connected layer elimination for a binarizec convolutional neural network on an FPGA

A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This pap...

Full description

Saved in:

Bibliographic Details
Published in	International Conference on Field-programmable Logic and Applications pp. 1 - 4
Main Authors	Nakahara, Hiroki, Fujii, Tomoya, Sato, Shimpei
Format	Conference Proceeding
Language	English Japanese
Published	Ghent University 01.09.2017
Subjects	Computer architecture Field programmable gate arrays Hardware Kernel Neurons Radiation detectors Random access memory
Online Access	Get full text
ISSN	1946-1488
DOI	10.23919/FPL.2017.8056771

Cover

More Information
Summary:	A pre-trained convolutional deep neural network (CNN) is widely used for embedded systems, which requires highly power-and-area efficiency. In that case, the CPU is too slow, the embedded GPU dissipates much power, and the ASIC cannot keep up with the rapidly progress of the CNN variations. This paper uses a binarized CNN which treats only binary 2-values for the inputs and the weights. Since the multiplier is replaced into an XNOR circuit, we can realize a high-performance MAC circuit by using many XNOR circuits. In the paper, we eliminate internal FC layers excluding the last one, then, insert a binarized average pooling layer, which can be realized by a majority circuit for binarized (1/0) values. In that case, since the weight memory is replaced into the 1's counter, we can realize a compact and faster CNN than the conventional ones. We implemented the VGG-11 benchmark CNN for the CIFAR10 image classification task on the Xilinx Inc. Zedboard. Compared with the conventional binarized implementations on an FPGA, the classification accuracy was almost the same, the performance per power efficiency is 5.1 better, as for the performance per area efficiency, it is 8.0 times better, and as for the performance per memory, it is 8.2 times better.
ISSN:	1946-1488
DOI:	10.23919/FPL.2017.8056771