Rock: Serving Multimodal Models in Cloud with Heterogeneous-Aware Resource Orchestration for Thousands of LoRA Adapters

In this paper, we present ROCK, a novel system for efficiently serving thousands of LoRA adapters for multimodal models in cloud environments. Through extensive analysis of production workloads, we identify key challenges in current cloud-based image generation services: extreme request burstiness (...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 13
Main Authors	Wu, Shuaipeng, Lin, Yanying, Peng, Shijie, Chen, Wenyan, Ma, Chong, Shen, Min, Chen, Le, Xu, Chengzhong, Ye, Kejiang
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Adaptation models Dynamic scheduling Hardware Image synthesis Loading Memory management Production Rocks Throughput Vectors
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186463

Cover

More Information
Summary:	In this paper, we present ROCK, a novel system for efficiently serving thousands of LoRA adapters for multimodal models in cloud environments. Through extensive analysis of production workloads, we identify key challenges in current cloud-based image generation services: extreme request burstiness (up to 90 \times normal rates), heterogeneous task characteristics, and inefficient adapter management that wastes 40 % of GPU memory and increases delays by 3 x during peak times. ROCK addresses these challenges through a three-layer architecture that decouples hardware, adapters, and requests. Our system features dynamic heterogeneous queues that match tasks to appropriate resources based on multidimensional feature vectors, and a multilevel orchestration framework that intelligently manages adapter placement across heterogeneous storage. Experiments on a 64-GPU testbed demonstrate that ROCK reduces average response latency by 16-26 \% , and achieves an 84.1 % cache hit rate for LoRA adapters-outperforming traditional approaches while reducing adapter update frequency by up to 77 %.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186463