Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters

In this paper, we present ROCK, a novel system for efficiently serving thousands of LoRA adapters for multimodal models in cloud environments. Through extensive analysis of production workloads, we identify key challenges in current cloud-based image generation services: extreme request burstiness (...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE International Conference on Cluster Computing pp. 1 - 13
Main Authors Wu, Shuaipeng, Lin, Yanying, Peng, Shijie, Chen, Wenyan, Ma, Chong, Shen, Min, Chen, Le, Xu, Chengzhong, Ye, Kejiang
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.09.2025
Subjects
Online AccessGet full text
ISSN2168-9253
DOI10.1109/CLUSTER59342.2025.11186463

Cover

More Information
Summary:In this paper, we present ROCK, a novel system for efficiently serving thousands of LoRA adapters for multimodal models in cloud environments. Through extensive analysis of production workloads, we identify key challenges in current cloud-based image generation services: extreme request burstiness (up to 90 \times normal rates), heterogeneous task characteristics, and inefficient adapter management that wastes 40 % of GPU memory and increases delays by 3 x during peak times. ROCK addresses these challenges through a three-layer architecture that decouples hardware, adapters, and requests. Our system features dynamic heterogeneous queues that match tasks to appropriate resources based on multidimensional feature vectors, and a multilevel orchestration framework that intelligently manages adapter placement across heterogeneous storage. Experiments on a 64-GPU testbed demonstrate that ROCK reduces average response latency by 16-26 \% , and achieves an 84.1 % cache hit rate for LoRA adapters-outperforming traditional approaches while reducing adapter update frequency by up to 77 %.
ISSN:2168-9253
DOI:10.1109/CLUSTER59342.2025.11186463