FATHOM: Fast Attention Through Optimizing Memory
Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and...
        Saved in:
      
    
          | Published in | Proceedings - IEEE International Parallel and Distributed Processing Symposium pp. 1166 - 1178 | 
|---|---|
| Main Authors | , , , | 
| Format | Conference Proceeding | 
| Language | English | 
| Published | 
            IEEE
    
        03.06.2025
     | 
| Subjects | |
| Online Access | Get full text | 
| ISSN | 1530-2075 | 
| DOI | 10.1109/IPDPS64566.2025.00106 | 
Cover
| Summary: | Transformer models are built on attention and feedforward layers that are predominantly matrix-matrix multiplication. Although matrix multiplication is often thought to be compute-bound, the matrix dimensions in attention are too small to reach peak compute throughput on many of today's CPU and GPU architectures. These same routines in many dense linear algebra libraries also do not reach the memory bound for matrix multiplication for these sizes. To improve the memory bandwidth utilization of transformer models, we employ a bandwidth-friendly data layout of intermediate data between operations and redesign our matrix multiplication kernels to optimize for memory utilization and bandwidth efficiency. We present Fast Attention Through Optimizing Memory (FATHOM), which achieves up to 6.7\times higher throughput in batch matrix multiplications and 1.8\times speedup in end-to-end models on CPU and GPU architectures. | 
|---|---|
| ISSN: | 1530-2075 | 
| DOI: | 10.1109/IPDPS64566.2025.00106 |