SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization

Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environmen...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 11
Main Authors	Zhao, Juntao, Wan, Borui, Peng, Yanghua, Lin, Haibin, Wu, Chuan
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Adaptation models Analytical models Computational modeling Heterogeneous Computing Inference Large language models Large Language Models (LLMs) Model Partitioning Pipelines Planning Predictive models Production Quantization Quantization (signal) Throughput
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186491

Cover

Abstract	Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments.
AbstractList	Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments.
Author	Peng, Yanghua Lin, Haibin Wu, Chuan Wan, Borui Zhao, Juntao
Author_xml	– sequence: 1 givenname: Juntao surname: Zhao fullname: Zhao, Juntao email: juntaozh@connect.hku.hk organization: The University of Hong Kong,Hong Kong – sequence: 2 givenname: Borui surname: Wan fullname: Wan, Borui email: wanborui@connect.hku.hk organization: The University of Hong Kong,Hong Kong – sequence: 3 givenname: Yanghua surname: Peng fullname: Peng, Yanghua email: pengyanghua.yanghua@bytedance.com organization: ByteDance Inc.,USA – sequence: 4 givenname: Haibin surname: Lin fullname: Lin, Haibin email: haibin.lin@bytedance.com organization: ByteDance Inc.,USA – sequence: 5 givenname: Chuan surname: Wu fullname: Wu, Chuan email: cwu@cs.hku.hk organization: The University of Hong Kong,Hong Kong
BookMark	eNqFj7FOwzAURQ0CiRb6BwxP7Cl23KQxW1UVOqQiNO1cWc1LeSjYke0EwciXUxDMTFe6R0dXd8jOjDXI2I3gYyG4up3n23KzWCdKTuJxzOPkWIssnShxwkZqqjIpRSK5UNkpG8QizSIVJ_KCDb1_4VxOJU8H7LNsGwpPnTbhDtbobef2GC3qmvaEJkCer-CxrhsyCCW6nswBrIElBnT2gAZt5-Gh2HroSUPxrD1GszftEFa2wgYK7QIFOiraVDCrdBuoR_gZpA_9Ta7Yea0bj6PfvGTX94vNfBkRIu5aR6_ave_-vsl_8Bf4a1bh
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/CLUSTER59342.2025.11186491
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798331530198
EISSN	2168-9253
EndPage	11
ExternalDocumentID	11186491
Genre	orig-research
GroupedDBID	6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL RNS
ID	FETCH-ieee_primary_111864913
IEDL.DBID	RIE
IngestDate	Wed Oct 15 14:21:20 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-ieee_primary_111864913
ParticipantIDs	ieee_primary_11186491
PublicationCentury	2000
PublicationDate	2025-Sept.-2
PublicationDateYYYYMMDD	2025-09-02
PublicationDate_xml	– month: 09 year: 2025 text: 2025-Sept.-2 day: 02
PublicationDecade	2020
PublicationTitle	Proceedings / IEEE International Conference on Cluster Computing
PublicationTitleAbbrev	CLUSTER
PublicationYear	2025
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0037306
Score	4.6086617
Snippet	Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Adaptation models Analytical models Computational modeling Heterogeneous Computing Inference Large language models Large Language Models (LLMs) Model Partitioning Pipelines Planning Predictive models Production Quantization Quantization (signal) Throughput
Title	SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization
URI	https://ieeexplore.ieee.org/document/11186491
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA66k6f5Y-KPKe_gNZ1bmzb1NsbmkDkrc7DbSJZURWnH1ip49C83L20VRcFbKTRN-l7ee6Tf9z1CzpQ0ZbzSkoYdoajXbksqPSZobAJjwGMuCsn867E_nHpXMzYryeqWC6O1tuAz7eCl_Zev0kWOR2Utsy-57yFXfTPgfkHWqsKua1zVL1VF2-dhqzeaTkxByELXQ75VhznV09_6qNg0MqiTcTWBAj3y5OSZdBZvP7QZ_z3DbdL4YuxB9JmLdsiGTnZJvWrZAOUO3iPvE1N0Zre5-Z4XUB3d077VkTAjw2h0DTdxjLUn2DCS3EOawBBBM6nxNZ3ma7iMpmt4eRQQPZgcSLuvYqUBm6o9Q4SeiLYGkSjoKrHEcAr2hSXjs0Gag_5db0hxXfNloXYxr5bk7pNakib6gABbhEIo7rs6kJ7iUsoAEYqcmfKASTc-JI1fhzj64_4x2UILWehWp0lq2SrXJybXZ_LU2vgDadCvSw
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4MHvSEPzD-QH0Hrx3C2rF5IwScOnAGSLiRlnZqNBuBTROP_uW23abRaOJtWbKu3Xt976X7vu8hdCa4KuOF5NhrMYFJs8kxJ5ThSAXGthu5LJfMHwwdf0Kup3RakNUNF0ZKacBn0tKX5l--SOaZPiprqH3pOkRz1dcpIYTmdK0y8NrKWZ1CV7R57jW6wWSkSkLq2UQzrlrUKp__1knFJJJ-FQ3LKeT4kScrS7k1f_uhzvjvOW6h2hdnD8LPbLSN1mS8g6pl0wYo9vAueh-psjO9y9QXvYDy8B73jJKEGhmCYAC3UaSrTzCBJL6HJAZfw2YS5W0yyVZwGU5W8PLIIHxQWRB3XtlSgm6r9gyh9kVtbWCxgI5gCx1Qwbyw4HzWUL3fG3d9rNc1W-R6F7NySfYeqsRJLPcR0LnHmHAdW7Y5ES7nvK0xii5VBQLldnSAar8OcfjH_VO04Y8HwSy4Gt4coU1tLQPkatVRJV1m8lhl_pSfGHt_AL-ispg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+International+Conference+on+Cluster+Computing&rft.atitle=SplitQuant%3A+Resource-Efficient+LLM+Offline+Serving+on+Heterogeneous+GPUs+via+Phase-Aware+Model+Partition+and+Adaptive+Quantization&rft.au=Zhao%2C+Juntao&rft.au=Wan%2C+Borui&rft.au=Peng%2C+Yanghua&rft.au=Lin%2C+Haibin&rft.date=2025-09-02&rft.pub=IEEE&rft.eissn=2168-9253&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1109%2FCLUSTER59342.2025.11186491&rft.externalDocID=11186491