SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch

Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on computer-aided design of integrated circuits and systems Vol. 44; no. 2; pp. 497 - 511
Main Authors Tang, Minjin, Wen, Mei, Yang, Jianchao, Xue, Zeyu, Shen, Junzhong
Format Journal Article
LanguageEnglish
Published New York IEEE 01.02.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN0278-0070
1937-4151
DOI10.1109/TCAD.2024.3434359

Cover

More Information
Summary:Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix multiplication, they exhibit inefficiency when dealing with sparse matrices. This inefficiency arises from the unnecessary operations performed by processing elements (PEs) that contain zero-valued entries, which do not contribute to the final result. To address this issue, we propose SPSA, a framework that leverages a sparse-packing algorithm suitable for systolic arrays to accelerate sparse matrix computations. Our approach achieves significant reduction of zero-valued items and improves matrix density by packing the rows or columns of the sparse matrix. Furthermore, we have introduced for the first time a data representation format tailored to systolic arrays, called CSXD, which further enhances storage and computational efficiency. Importantly, our adaptation scheme enables acceleration benefits even with limited resources. Through sparse packing, SPSA achieved a <inline-formula> <tex-math notation="LaTeX">5.2\times </tex-math></inline-formula> performance improvement compared to the dense baseline, and further reached a <inline-formula> <tex-math notation="LaTeX">6.4\times </tex-math></inline-formula> enhancement via CSXD. Simultaneously, CSXD realized an average storage efficiency improvement of <inline-formula> <tex-math notation="LaTeX">15.0\times </tex-math></inline-formula>. Through extensive evaluations, SPSA outperforms previous designs on CPU, GPU, and ASIC platforms. Finally, in end-to-end evaluations, SPSA achieved a performance improvement of 3.9 times across the workloads of BERT, VGG19, and ResNet50.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0278-0070
1937-4151
DOI:10.1109/TCAD.2024.3434359