FSpGEMM: A Framework for Accelerating Sparse General Matrix-Matrix Multiplication Using Gustavson's Algorithm on FPGAs

General sparse matrix-matrix multiplication (SpGEMM) is integral to many high-performance computing (HPC) and machine learning applications. However, prior field-programmable gate array (FPGA)-based SpGEMM accelerators either use the inner product algorithm with wasted and costly operations or Gusta...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on very large scale integration (VLSI) systems Vol. 32; no. 4; pp. 1 - 0
Main Authors	Tavakoli, Erfan Bank, Riera, Michael, Quraishi, Masudul Hassan, Ren, Fengbo
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Benchmarks Computer architecture Computer memory Embedded systems Field programmable gate arrays Field-programmable gate array (FPGA) Format general sparse matrix–matrix multiplication (SpGEMM) Graphics processing units Gustavson’s algorithm Hardware Indexes Machine learning Mathematical analysis Matrix algebra Matrix converters Memory management Multiplication Network latency OpenCL Performance enhancement Preprocessing reconfigurable computing Resource utilization Sparse matrices Sparsity
Online Access	Get full text
ISSN	1063-8210 1557-9999
DOI	10.1109/TVLSI.2024.3355499

Cover

More Information
Summary:	General sparse matrix-matrix multiplication (SpGEMM) is integral to many high-performance computing (HPC) and machine learning applications. However, prior field-programmable gate array (FPGA)-based SpGEMM accelerators either use the inner product algorithm with wasted and costly operations or Gustavson's algorithm with a cache-based hardware architecture suffering from long-latency cache miss penalties and limited to embedded devices. In this work, we propose framework for accelerating SpGEMM (FSpGEMM), an OpenCL-based SpGEMM framework for accelerating Gustvason's algorithm that includes an FPGA kernel implementing a throughput-optimized and scalable hardware architecture compatible with high-bandwidth memory (HBM) or traditional DDR-based memory. In addition, to address the irregular memory access patterns incurred by Gustavson's algorithm, we propose a new buffering scheme tailored to Gustavson's algorithm enabled by a new compressed sparse vector (CSV) format for representing sparse matrices and a row reordering technique as a preprocessing step to improve data reuse, and consequently, resource utilization. The proposed framework includes a host program implementing preprocessing functions for reordering input matrices and storing them in the proposed CSV format for further use. We implemented FSpGEMM using Intel FPGA SDK for OpenCL and experimented with a benchmark of sparse matrices selected from the SuiteSparse Matrix Collection on a Bittware 520N-MX FPGA board. The results show that the reordering technique improves the performance on average by 20.3% compared with the baseline. Finally, FSpGEMM outperforms the state-of-the-art (SOTA) FPGA implementation by an average of 2.23<inline-formula> <tex-math notation="LaTeX">\times</tex-math> </inline-formula> in terms of execution cycles with the same benchmark and memory system configuration for a fair comparison.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1063-8210 1557-9999
DOI:	10.1109/TVLSI.2024.3355499