Diffusion Model-aided Resource Scheduling for Multiple GAI Training Jobs

With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed...

Full description

Saved in:
Bibliographic Details
Published inProceedings - International Conference on Computer Communications and Networks pp. 1 - 6
Main Authors Yuan, Meng, Wu, Qiang, Wang, Xiangbin, Sun, Siyang
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.08.2025
Subjects
Online AccessGet full text
ISSN2637-9430
DOI10.1109/ICCCN65249.2025.11134026

Cover

More Information
Summary:With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed training exacerbate resource fragmentation and network contention, resulting in low utilization and high latency. To this end, we propose an intelligent and dynamic resource scheduling method. Firstly, we propose an innovative scheduling analytical model that describes heterogeneous computing resources, communication contention, and the parameter synchronization architecture. We then formulate it as a multi-objective optimization problem. Next, we propose a Diffusion Model-based AI-generated Resource Scheduling (DARS) algorithm, to capture dynamic and high-dimensional environment and generate the optimal scheduling decisions. Finally, the policy network of deep reinforcement learning (DRL) is replaced with the proposed DARS to address the environmental uncertainty and enhance efficiency. Simulation results demonstrate that our proposed algorithm outperforms associated algorithms.
ISSN:2637-9430
DOI:10.1109/ICCCN65249.2025.11134026