SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization
Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environmen...
Saved in:
| Published in | Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 11 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
02.09.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2168-9253 |
| DOI | 10.1109/CLUSTER59342.2025.11186491 |
Cover
| Abstract | Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments. |
|---|---|
| AbstractList | Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for parallel computation across accelerators and quantization for reducing parameter size. While existing systems assume homogeneous GPU environments, we reveal significant untapped potential in heterogeneous systems with mixed-capacity accelerators where two critical limitations persist: (1) uniform partitioning and quantization strategies fail to adapt to hardware heterogeneity, exacerbating resource imbalance, and (2) decoupled optimization of partitioning and quantization overlooks critical performance synergies between these techniques. We present SplitQuant, a phase-aware distributed serving system that co-optimizes mixedprecision quantization, phase-aware model partitioning, and micro-batch sizing for heterogeneous environments. Our approach combines analytical modeling of quality-runtime tradeoffs with a lightweight planning algorithm to maximize throughput while preserving user-specified model quality targets. Evaluations across 10 production clusters show SplitQuant achieves up to 2.34 \times(1.61 \times mean) higher throughput than state-of-theart approaches without violating accuracy targets. Our results underscore the value of co-designing quantization and model partitioning strategies for heterogeneous environments. |
| Author | Peng, Yanghua Lin, Haibin Wu, Chuan Wan, Borui Zhao, Juntao |
| Author_xml | – sequence: 1 givenname: Juntao surname: Zhao fullname: Zhao, Juntao email: juntaozh@connect.hku.hk organization: The University of Hong Kong,Hong Kong – sequence: 2 givenname: Borui surname: Wan fullname: Wan, Borui email: wanborui@connect.hku.hk organization: The University of Hong Kong,Hong Kong – sequence: 3 givenname: Yanghua surname: Peng fullname: Peng, Yanghua email: pengyanghua.yanghua@bytedance.com organization: ByteDance Inc.,USA – sequence: 4 givenname: Haibin surname: Lin fullname: Lin, Haibin email: haibin.lin@bytedance.com organization: ByteDance Inc.,USA – sequence: 5 givenname: Chuan surname: Wu fullname: Wu, Chuan email: cwu@cs.hku.hk organization: The University of Hong Kong,Hong Kong |
| BookMark | eNqFj7FOwzAURQ0CiRb6BwxP7Cl23KQxW1UVOqQiNO1cWc1LeSjYke0EwciXUxDMTFe6R0dXd8jOjDXI2I3gYyG4up3n23KzWCdKTuJxzOPkWIssnShxwkZqqjIpRSK5UNkpG8QizSIVJ_KCDb1_4VxOJU8H7LNsGwpPnTbhDtbobef2GC3qmvaEJkCer-CxrhsyCCW6nswBrIElBnT2gAZt5-Gh2HroSUPxrD1GszftEFa2wgYK7QIFOiraVDCrdBuoR_gZpA_9Ta7Yea0bj6PfvGTX94vNfBkRIu5aR6_ave_-vsl_8Bf4a1bh |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/CLUSTER59342.2025.11186491 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9798331530198 |
| EISSN | 2168-9253 |
| EndPage | 11 |
| ExternalDocumentID | 11186491 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL RNS |
| ID | FETCH-ieee_primary_111864913 |
| IEDL.DBID | RIE |
| IngestDate | Wed Oct 15 14:21:20 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-ieee_primary_111864913 |
| ParticipantIDs | ieee_primary_11186491 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Sept.-2 |
| PublicationDateYYYYMMDD | 2025-09-02 |
| PublicationDate_xml | – month: 09 year: 2025 text: 2025-Sept.-2 day: 02 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings / IEEE International Conference on Cluster Computing |
| PublicationTitleAbbrev | CLUSTER |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0037306 |
| Score | 4.6086617 |
| Snippet | Modern large language models (LLMs) serving systems address distributed deployment challenges through two key techniques: distributed model partitioning for... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Adaptation models Analytical models Computational modeling Heterogeneous Computing Inference Large language models Large Language Models (LLMs) Model Partitioning Pipelines Planning Predictive models Production Quantization Quantization (signal) Throughput |
| Title | SplitQuant: Resource-Efficient LLM Offline Serving on Heterogeneous GPUs via Phase-Aware Model Partition and Adaptive Quantization |
| URI | https://ieeexplore.ieee.org/document/11186491 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwFA66k6f5Y-KPKe_gNZ1bmzb1NsbmkDkrc7DbSJZURWnH1ip49C83L20VRcFbKTRN-l7ee6Tf9z1CzpQ0ZbzSkoYdoajXbksqPSZobAJjwGMuCsn867E_nHpXMzYryeqWC6O1tuAz7eCl_Zev0kWOR2Utsy-57yFXfTPgfkHWqsKua1zVL1VF2-dhqzeaTkxByELXQ75VhznV09_6qNg0MqiTcTWBAj3y5OSZdBZvP7QZ_z3DbdL4YuxB9JmLdsiGTnZJvWrZAOUO3iPvE1N0Zre5-Z4XUB3d077VkTAjw2h0DTdxjLUn2DCS3EOawBBBM6nxNZ3ma7iMpmt4eRQQPZgcSLuvYqUBm6o9Q4SeiLYGkSjoKrHEcAr2hSXjs0Gag_5db0hxXfNloXYxr5bk7pNakib6gABbhEIo7rs6kJ7iUsoAEYqcmfKASTc-JI1fhzj64_4x2UILWehWp0lq2SrXJybXZ_LU2vgDadCvSw |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4MHvSEPzD-QH0Hrx3C2rF5IwScOnAGSLiRlnZqNBuBTROP_uW23abRaOJtWbKu3Xt976X7vu8hdCa4KuOF5NhrMYFJs8kxJ5ThSAXGthu5LJfMHwwdf0Kup3RakNUNF0ZKacBn0tKX5l--SOaZPiprqH3pOkRz1dcpIYTmdK0y8NrKWZ1CV7R57jW6wWSkSkLq2UQzrlrUKp__1knFJJJ-FQ3LKeT4kScrS7k1f_uhzvjvOW6h2hdnD8LPbLSN1mS8g6pl0wYo9vAueh-psjO9y9QXvYDy8B73jJKEGhmCYAC3UaSrTzCBJL6HJAZfw2YS5W0yyVZwGU5W8PLIIHxQWRB3XtlSgm6r9gyh9kVtbWCxgI5gCx1Qwbyw4HzWUL3fG3d9rNc1W-R6F7NySfYeqsRJLPcR0LnHmHAdW7Y5ES7nvK0xii5VBQLldnSAar8OcfjH_VO04Y8HwSy4Gt4coU1tLQPkatVRJV1m8lhl_pSfGHt_AL-ispg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+International+Conference+on+Cluster+Computing&rft.atitle=SplitQuant%3A+Resource-Efficient+LLM+Offline+Serving+on+Heterogeneous+GPUs+via+Phase-Aware+Model+Partition+and+Adaptive+Quantization&rft.au=Zhao%2C+Juntao&rft.au=Wan%2C+Borui&rft.au=Peng%2C+Yanghua&rft.au=Lin%2C+Haibin&rft.date=2025-09-02&rft.pub=IEEE&rft.eissn=2168-9253&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1109%2FCLUSTER59342.2025.11186491&rft.externalDocID=11186491 |