FATHOM: Fast Attention Through Optimizing Memory

Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings - IEEE International Parallel and Distributed Processing Symposium pp. 1166 - 1178
Main Authors	Binder, Elliott, Sudarsanam, Arvind, Sunkavalli, Ravi, Low, Tze Meng
Format	Conference Proceeding
Language	English
Published	IEEE 03.06.2025
Subjects	attention Computational modeling Graphics processing units Kernel Layout Libraries linear algebra libraries Matrices matrix multiplication Memory management memory-bound Spectral efficiency Throughput Transformers
Online Access	Get full text
ISSN	1530-2075
DOI	10.1109/IPDPS64566.2025.00106

Cover

More Information
Summary:	Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and GPU architectures. These same routines in many dense linear algebra libraries also do not reach the memory bound for matrix multiplication for these sizes. To improve the memory bandwidth utilization of transformer models, we employ a bandwidth-friendly data layout of intermediate data between operations and redesign our matrix multiplication kernels to optimize for memory utilization and bandwidth efficiency. We present Fast Attention Through Optimizing Memory (FATHOM), which achieves up to 6.7\times higher throughput in batch matrix multiplications and 1.8\times speedup in end-to-end models on CPU and GPU architectures.
ISSN:	1530-2075
DOI:	10.1109/IPDPS64566.2025.00106