SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization
Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environmen...
Saved in:
| Published in | Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 11 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
02.09.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2168-9253 |
| DOI | 10.1109/CLUSTER59342.2025.11186491 |
Cover
| Summary: | Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments. |
|---|---|
| ISSN: | 2168-9253 |
| DOI: | 10.1109/CLUSTER59342.2025.11186491 |