SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization

Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environmen...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 11
Main Authors	Zhao, Juntao, Wan, Borui, Peng, Yanghua, Lin, Haibin, Wu, Chuan
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Adaptation models Analytical models Computational modeling Heterogeneous Computing Inference Large language models Large Language Models (LLMs) Model Partitioning Pipelines Planning Predictive models Production Quantization Quantization (signal) Throughput
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186491

Cover

More Information
Summary:	Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186491