SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch

Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computer-aided design of integrated circuits and systems Vol. 44; no. 2; pp. 497 - 511
Main Authors	Tang, Minjin, Wen, Mei, Yang, Jianchao, Xue, Zeyu, Shen, Junzhong
Format	Journal Article
Language	English
Published	New York IEEE 01.02.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Arrays Computation Computational efficiency Hardware Integrated circuits Matrix algebra Performance evaluation Software Sparse computation Sparse matrices Sparsity Systolic arrays
Online Access	Get full text
ISSN	0278-0070 1937-4151
DOI	10.1109/TCAD.2024.3434359

Cover

More Information
Summary:	Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix multiplication, they exhibit inefficiency when dealing with sparse matrices. This inefficiency arises from the unnecessary operations performed by processing elements (PEs) that contain zero-valued entries, which do not contribute to the final result. To address this issue, we propose SPSA, a framework that leverages a sparse-packing algorithm suitable for systolic arrays to accelerate sparse matrix computations. Our approach achieves significant reduction of zero-valued items and improves matrix density by packing the rows or columns of the sparse matrix. Furthermore, we have introduced for the first time a data representation format tailored to systolic arrays, called CSXD, which further enhances storage and computational efficiency. Importantly, our adaptation scheme enables acceleration benefits even with limited resources. Through sparse packing, SPSA achieved a <inline-formula> <tex-math notation="LaTeX">5.2\times </tex-math></inline-formula> performance improvement compared to the dense baseline, and further reached a <inline-formula> <tex-math notation="LaTeX">6.4\times </tex-math></inline-formula> enhancement via CSXD. Simultaneously, CSXD realized an average storage efficiency improvement of <inline-formula> <tex-math notation="LaTeX">15.0\times </tex-math></inline-formula>. Through extensive evaluations, SPSA outperforms previous designs on CPU, GPU, and ASIC platforms. Finally, in end-to-end evaluations, SPSA achieved a performance improvement of 3.9 times across the workloads of BERT, VGG19, and ResNet50.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0278-0070 1937-4151
DOI:	10.1109/TCAD.2024.3434359