Libra: A Hybrid-Sparse Attention Accelerator Featuring Multi-Level Workload Balance

Transformers have delivered exceptional performance and are widely used across various natural language processing (NLP) tasks, owing to their powerful attention mechanism. However, the high computational complexity and substantial memory usage pose significant challenges to inference efficiency. Nu...

Full description

Saved in:
Bibliographic Details
Published in2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors Sun, Faxian, Zhang, Runzhou, Liu, Zhenyu, Liao, Heng, Qin, Zhinan, Chen, Jianli, Yu, Jun, Wang, Kun
Format Conference Proceeding
LanguageEnglish
Published IEEE 22.06.2025
Subjects
Online AccessGet full text
DOI10.1109/DAC63849.2025.11133063

Cover

More Information
Summary:Transformers have delivered exceptional performance and are widely used across various natural language processing (NLP) tasks, owing to their powerful attention mechanism. However, the high computational complexity and substantial memory usage pose significant challenges to inference efficiency. Numerous quantization and value-level sparsification methods have been proposed to overcome these challenges. Since higher sparsity leads to greater acceleration efficiency, leveraging both value-level and bit-level sparsity (hybrid sparsity) can effectively exploit the acceleration potential of the attention mechanism. However, increased sparsity exacerbates load imbalance across compute units, potentially limiting the extent of acceleration benefits. To fully exploit the acceleration potential of hybrid sparsity, we propose Libra, an attention accelerator developed through algorithm-hardware co-design. At the algorithm level, we design the bit-group-based algorithm consisting of filtered bit-group sparsification (FBS) and dynamic bit-group quantization (DBQ) to maximize the utilization of sparsity in attention. FBS imposes structured sparsity on weights, while DBQ introduces dynamic sparsification during the computation of activations. At the hardware level, we design task pool to achieve multi-level workload balance, effectively mitigating the load imbalance among compute units induced by hybrid sparsity. Additionally, different stages in DBQ can be executed in parallel, with each stage operating at distinct bit-widths. To support this, we design an adaptive bit-width architecture that enables simultaneous computations at varying bitwidths. Our experiments demonstrate that, compared to state-of-the-art (SOTA) attention accelerators, Libra achieves up to 1.49 \times \sim 5.89 \times speedup and 2.65 \times \sim 10.82 \times enhancement in energy efficiency.
DOI:10.1109/DAC63849.2025.11133063