Darkside: A Heterogeneous RISC-V Compute Cluster for Extreme-Edge On-Chip DNN Inference and Training

On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of ded...

Full description

Saved in:

Bibliographic Details
Published in	IEEE open journal of solid-state circuits Vol. 2; p. 1
Main Authors	Garofalo, Angelo, Tortorella, Yvan, Perotti, Matteo, Valente, Luca, Nadalini, Alessandro, Benini, Luca, Rossi, Davide, Conti, Francesco
Format	Journal Article
Language	English
Published	New York IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acceleration Accelerators Arithmetic Artificial neural networks Clusters Efficiency Engines Flexibility Floating point arithmetic Hardware Heterogeneous Cluster Human computer interaction Inference Integers Kernel Kernels Multiplication System on chip Tensor Product Engine tensor product engine (TPE) Tensors Training Ultra-Low-Power AI ultralow-power AI
Online Access	Get full text
ISSN	2644-1349 2644-1349
DOI	10.1109/OJSSCS.2022.3210082

Cover

More Information
Summary:	On-chip DNN inference and training at the Extreme-Edge (TinyML) impose strict latency, throughput, accuracy and flexibility requirements. Heterogeneous clusters are promising solutions to meet the challenge, combining the flexibility of DSP-enhanced cores with the performance and energy boost of dedicated accelerators. We present Darkside, a System-on-Chip with a heterogeneous cluster of 8 RISC-V cores enhanced with 2-b to 32-b mixed-precision integer arithmetic. To boost performance and efficiency on key compute-intensive Deep Neural Network (DNN) kernels, the cluster is enriched with three digital accelerators: a specialized engine for low-data-reuse depthwise convolution kernels (up to 30 MAC/cycle); a minimal overhead datamover to marshal 1-b to 32-b data on-the-fly; a 16-b floating point Tensor Product Engine (TPE) for tiled matrix-multiplication acceleration. Darkside is implemented in 65nm CMOS technology. The cluster achieves a peak integer performance of 65 GOPS and a peak efficiency of 835 GOPS/W when working on 2-b integer DNN kernels. When targeting floating-point tensor operations, the TPE provides up to 18.2 GFLOPS of performance or 300 GFLOPS/W of efficiency - enough to enable on-chip floating-point training at competitive speed coupled with ultra-low power quantized inference.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2644-1349 2644-1349
DOI:	10.1109/OJSSCS.2022.3210082