FATHOM: Fast Attention Through Optimizing Memory

Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and...

Full description

Saved in:
Bibliographic Details
Published inProceedings - IEEE International Parallel and Distributed Processing Symposium pp. 1166 - 1178
Main Authors Binder, Elliott, Sudarsanam, Arvind, Sunkavalli, Ravi, Low, Tze Meng
Format Conference Proceeding
LanguageEnglish
Published IEEE 03.06.2025
Subjects
Online AccessGet full text
ISSN1530-2075
DOI10.1109/IPDPS64566.2025.00106

Cover

More Information
Summary:Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and GPU architectures. These same routines in many dense linear algebra libraries also do not reach the memory bound for matrix multiplication for these sizes. To improve the memory bandwidth utilization of transformer models, we employ a bandwidth-friendly data layout of intermediate data between operations and redesign our matrix multiplication kernels to optimize for memory utilization and bandwidth efficiency. We present Fast Attention Through Optimizing Memory (FATHOM), which achieves up to 6.7\times higher throughput in batch matrix multiplications and 1.8\times speedup in end-to-end models on CPU and GPU architectures.
ISSN:1530-2075
DOI:10.1109/IPDPS64566.2025.00106