FSpGEMM: A Framework for Accelerating Sparse General Matrix-Matrix Multiplication Using Gustavson's Algorithm on FPGAs

General sparse matrix-matrix multiplication (SpGEMM) is integral to many high-performance computing (HPC) and machine learning applications. However, prior field-programmable gate array (FPGA)-based SpGEMM accelerators either use the inner product algorithm with wasted and costly operations or Gusta...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on very large scale integration (VLSI) systems Vol. 32; no. 4; pp. 1 - 0
Main Authors Tavakoli, Erfan Bank, Riera, Michael, Quraishi, Masudul Hassan, Ren, Fengbo
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1063-8210
1557-9999
DOI10.1109/TVLSI.2024.3355499

Cover

More Information
Summary:General sparse matrix-matrix multiplication (SpGEMM) is integral to many high-performance computing (HPC) and machine learning applications. However, prior field-programmable gate array (FPGA)-based SpGEMM accelerators either use the inner product algorithm with wasted and costly operations or Gustavson's algorithm with a cache-based hardware architecture suffering from long-latency cache miss penalties and limited to embedded devices. In this work, we propose framework for accelerating SpGEMM (FSpGEMM), an OpenCL-based SpGEMM framework for accelerating Gustvason's algorithm that includes an FPGA kernel implementing a throughput-optimized and scalable hardware architecture compatible with high-bandwidth memory (HBM) or traditional DDR-based memory. In addition, to address the irregular memory access patterns incurred by Gustavson's algorithm, we propose a new buffering scheme tailored to Gustavson's algorithm enabled by a new compressed sparse vector (CSV) format for representing sparse matrices and a row reordering technique as a preprocessing step to improve data reuse, and consequently, resource utilization. The proposed framework includes a host program implementing preprocessing functions for reordering input matrices and storing them in the proposed CSV format for further use. We implemented FSpGEMM using Intel FPGA SDK for OpenCL and experimented with a benchmark of sparse matrices selected from the SuiteSparse Matrix Collection on a Bittware 520N-MX FPGA board. The results show that the reordering technique improves the performance on average by 20.3% compared with the baseline. Finally, FSpGEMM outperforms the state-of-the-art (SOTA) FPGA implementation by an average of 2.23<inline-formula> <tex-math notation="LaTeX">\times</tex-math> </inline-formula> in terms of execution cycles with the same benchmark and memory system configuration for a fair comparison.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1063-8210
1557-9999
DOI:10.1109/TVLSI.2024.3355499