SPSA: Exploring Sparse-Packing Computation on Systolic Arrays From Scratch
Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix...
Saved in:
| Published in | IEEE transactions on computer-aided design of integrated circuits and systems Vol. 44; no. 2; pp. 497 - 511 |
|---|---|
| Main Authors | , , , , |
| Format | Journal Article |
| Language | English |
| Published |
New York
IEEE
01.02.2025
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0278-0070 1937-4151 |
| DOI | 10.1109/TCAD.2024.3434359 |
Cover
| Summary: | Sparse matrix-matrix multiplication (SpMM) and Generalized SpMM (SpGEMM) are essential computational kernels in domains, such as graph analytics and scientific computation. While systolic arrays have traditionally been employed as specialized architectures for complex computing problems like matrix multiplication, they exhibit inefficiency when dealing with sparse matrices. This inefficiency arises from the unnecessary operations performed by processing elements (PEs) that contain zero-valued entries, which do not contribute to the final result. To address this issue, we propose SPSA, a framework that leverages a sparse-packing algorithm suitable for systolic arrays to accelerate sparse matrix computations. Our approach achieves significant reduction of zero-valued items and improves matrix density by packing the rows or columns of the sparse matrix. Furthermore, we have introduced for the first time a data representation format tailored to systolic arrays, called CSXD, which further enhances storage and computational efficiency. Importantly, our adaptation scheme enables acceleration benefits even with limited resources. Through sparse packing, SPSA achieved a <inline-formula> <tex-math notation="LaTeX">5.2\times </tex-math></inline-formula> performance improvement compared to the dense baseline, and further reached a <inline-formula> <tex-math notation="LaTeX">6.4\times </tex-math></inline-formula> enhancement via CSXD. Simultaneously, CSXD realized an average storage efficiency improvement of <inline-formula> <tex-math notation="LaTeX">15.0\times </tex-math></inline-formula>. Through extensive evaluations, SPSA outperforms previous designs on CPU, GPU, and ASIC platforms. Finally, in end-to-end evaluations, SPSA achieved a performance improvement of 3.9 times across the workloads of BERT, VGG19, and ResNet50. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISSN: | 0278-0070 1937-4151 |
| DOI: | 10.1109/TCAD.2024.3434359 |