A Low-Power Reconfigurable DNN Accelerator for Instruction-Extended RISC-V

Deep neural networks (DNNs) find extensive applications across diverse domains, including Speech Recognition, Face Detection, and Image Classification. While the conventional approach relies on Graphics Processing Units (GPUs) for DNN implementation, it prioritizes speed at the expense of efficiency...

Full description

Saved in:
Bibliographic Details
Published inIPSJ Transactions on System and LSI Design Methodology Vol. 17; pp. 55 - 66
Main Authors Li, Dongju, Wang, Hansen, Isshiki, Tsuyoshi
Format Journal Article
LanguageEnglish
Published Tokyo Information Processing Society of Japan 01.01.2024
Japan Science and Technology Agency
Subjects
Online AccessGet full text
ISSN1882-6687
1882-6687
DOI10.2197/ipsjtsldm.17.55

Cover

More Information
Summary:Deep neural networks (DNNs) find extensive applications across diverse domains, including Speech Recognition, Face Detection, and Image Classification. While the conventional approach relies on Graphics Processing Units (GPUs) for DNN implementation, it prioritizes speed at the expense of efficiency. In the pursuit of reduced power consumption and enhanced efficiency, we advocate for the adoption of application-specific hardware computing. This paper introduces a run-time reconfigurable DNN accelerator SoC (DNN-AS) architecture, seamlessly integrated into the instruction-extended RISC-V platform. The meticulously crafted application-specific extension instruction set is tailored to expedite high-frequency DNN operations. To optimize circuit structure, we have devised an 8-bit dynamic fixed-point (DFP) scheme within the DNN-AS. Furthermore, we conduct a comparative accuracy analysis between DFP and the PyTorch float implementation. Our results demonstrate that DNN-AS exhibits minimal accuracy loss, with Top 1 accuracy deviations of only up to 0.53%, 0.31%, and 0.68% for RESNET34, RESNET50, and RESNET101, respectively. Finally, we juxtapose the overall simulated results with other platforms, revealing that our design has achieved remarkable improvements in throughput per joule (GOP/J), ranging from 8.4x to 1897x compared to Field-Programmable Gate Arrays (FPGAs) and GPU.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1882-6687
1882-6687
DOI:10.2197/ipsjtsldm.17.55