Libra: A Hybrid-Sparse Attention Accelerator Featuring Multi-Level Workload Balance

Transformers have delivered exceptional performance and are widely used across various natural language processing (NLP) tasks, owing to their powerful attention mechanism. However, the high computational complexity and substantial memory usage pose significant challenges to inference efficiency. Nu...

Full description

Saved in:

Bibliographic Details
Published in	2025 62nd ACM/IEEE Design Automation Conference (DAC) pp. 1 - 7
Main Authors	Sun, Faxian, Zhang, Runzhou, Liu, Zhenyu, Liao, Heng, Qin, Zhinan, Chen, Jianli, Yu, Jun, Wang, Kun
Format	Conference Proceeding
Language	English
Published	IEEE 22.06.2025
Subjects	Attention mechanisms Energy efficiency Heuristic algorithms Limiting Memory management Natural language processing Optimization Quantization (signal) Redundancy Transformers
Online Access	Get full text
DOI	10.1109/DAC63849.2025.11133063

Cover

More Information
Summary:	Transformers have delivered exceptional performance and are widely used across various natural language processing (NLP) tasks, owing to their powerful attention mechanism. However, the high computational complexity and substantial memory usage pose significant challenges to inference efficiency. Numerous quantization and value-level sparsification methods have been proposed to overcome these challenges. Since higher sparsity leads to greater acceleration efficiency, leveraging both value-level and bit-level sparsity (hybrid sparsity) can effectively exploit the acceleration potential of the attention mechanism. However, increased sparsity exacerbates load imbalance across compute units, potentially limiting the extent of acceleration benefits. To fully exploit the acceleration potential of hybrid sparsity, we propose Libra, an attention accelerator developed through algorithm-hardware co-design. At the algorithm level, we design the bit-group-based algorithm consisting of filtered bit-group sparsification (FBS) and dynamic bit-group quantization (DBQ) to maximize the utilization of sparsity in attention. FBS imposes structured sparsity on weights, while DBQ introduces dynamic sparsification during the computation of activations. At the hardware level, we design task pool to achieve multi-level workload balance, effectively mitigating the load imbalance among compute units induced by hybrid sparsity. Additionally, different stages in DBQ can be executed in parallel, with each stage operating at distinct bit-widths. To support this, we design an adaptive bit-width architecture that enables simultaneous computations at varying bitwidths. Our experiments demonstrate that, compared to state-of-the-art (SOTA) attention accelerators, Libra achieves up to 1.49 \times \sim 5.89 \times speedup and 2.65 \times \sim 10.82 \times enhancement in energy efficiency.
DOI:	10.1109/DAC63849.2025.11133063