Diffusion Model-aided Resource Scheduling for Multiple GAI Training Jobs

With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed...

Full description

Saved in:
Bibliographic Details
Published inProceedings - International Conference on Computer Communications and Networks pp. 1 - 6
Main Authors Yuan, Meng, Wu, Qiang, Wang, Xiangbin, Sun, Siyang
Format Conference Proceeding
LanguageEnglish
Published IEEE 04.08.2025
Subjects
Online AccessGet full text
ISSN2637-9430
DOI10.1109/ICCCN65249.2025.11134026

Cover

Abstract With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed training exacerbate resource fragmentation and network contention, resulting in low utilization and high latency. To this end, we propose an intelligent and dynamic resource scheduling method. Firstly, we propose an innovative scheduling analytical model that describes heterogeneous computing resources, communication contention, and the parameter synchronization architecture. We then formulate it as a multi-objective optimization problem. Next, we propose a Diffusion Model-based AI-generated Resource Scheduling (DARS) algorithm, to capture dynamic and high-dimensional environment and generate the optimal scheduling decisions. Finally, the policy network of deep reinforcement learning (DRL) is replaced with the proposed DARS to address the environmental uncertainty and enhance efficiency. Simulation results demonstrate that our proposed algorithm outperforms associated algorithms.
AbstractList With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed training exacerbate resource fragmentation and network contention, resulting in low utilization and high latency. To this end, we propose an intelligent and dynamic resource scheduling method. Firstly, we propose an innovative scheduling analytical model that describes heterogeneous computing resources, communication contention, and the parameter synchronization architecture. We then formulate it as a multi-objective optimization problem. Next, we propose a Diffusion Model-based AI-generated Resource Scheduling (DARS) algorithm, to capture dynamic and high-dimensional environment and generate the optimal scheduling decisions. Finally, the policy network of deep reinforcement learning (DRL) is replaced with the proposed DARS to address the environmental uncertainty and enhance efficiency. Simulation results demonstrate that our proposed algorithm outperforms associated algorithms.
Author Wang, Xiangbin
Sun, Siyang
Yuan, Meng
Wu, Qiang
Author_xml – sequence: 1
  givenname: Meng
  surname: Yuan
  fullname: Yuan, Meng
  email: yuan_m@nuaa.edu.cn
  organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China
– sequence: 2
  givenname: Qiang
  surname: Wu
  fullname: Wu, Qiang
  email: wu.qiang@nuaa.edu.cn
  organization: Zhejiang University,College of Computer Science and Technology,Hangzhou,China
– sequence: 3
  givenname: Xiangbin
  surname: Wang
  fullname: Wang, Xiangbin
  email: Leo_Wang_XB@nuaa.edu.cn
  organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China
– sequence: 4
  givenname: Siyang
  surname: Sun
  fullname: Sun, Siyang
  email: siyang.sun@nuaa.edu.cn
  organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China
BookMark eNo1j81KAzEURqMo2Na-gYu8wNT8zNwkyzJqO9IqaPclk9xoZJyUSWfh21tRVwe-Ax-cKbnoU4-EUM4WnDNz29R1_QSVKM1CMFGdRi5LJuCMzI0yWkpeMW20OCcTAVIVppTsikxz_mCMaWDlhKzvYghjjqmn2-SxK2z06OkL5jQODumre0c_drF_oyENdDt2x3jokK6WDd0NNvY_5jG1-ZpcBttlnP9xRnYP97t6XWyeV0293BTRyGMB2FrZsjYobsD4ylkrFGhufWUtiDLI4BCACVcqDV4rE9wpRFjlvG1Byxm5-b2NiLg_DPHTDl_7_3D5DfKlT3g
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICCCN65249.2025.11134026
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9798331508982
EISSN 2637-9430
EndPage 6
ExternalDocumentID 11134026
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
– fundername: National Key Research and Development Program of China
  funderid: 10.13039/501100012166
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IPLJI
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i93t-6eba3b0bf71969d5caa27681ad5aa624f3fce6602c4786d879fc9832a7cdab683
IEDL.DBID RIE
IngestDate Wed Sep 03 07:09:37 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i93t-6eba3b0bf71969d5caa27681ad5aa624f3fce6602c4786d879fc9832a7cdab683
PageCount 6
ParticipantIDs ieee_primary_11134026
PublicationCentury 2000
PublicationDate 2025-Aug.-4
PublicationDateYYYYMMDD 2025-08-04
PublicationDate_xml – month: 08
  year: 2025
  text: 2025-Aug.-4
  day: 04
PublicationDecade 2020
PublicationTitle Proceedings - International Conference on Computer Communications and Networks
PublicationTitleAbbrev ICCCN
PublicationYear 2025
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008604
Score 2.3018243
Snippet With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms AIGC
Clustering algorithms
Computational modeling
Deep reinforcement learning
diffusion model
Dynamic scheduling
Heuristic algorithms
multiple GAI distributed training jobs
Optimal scheduling
Processor scheduling
resource scheduling
Synchronization
Training
Uncertainty
Title Diffusion Model-aided Resource Scheduling for Multiple GAI Training Jobs
URI https://ieeexplore.ieee.org/document/11134026
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT3rxVfHNHrwmXZPsZnOUaG0LLYIVeiv7lKK0RZOLv96ZNKkPELyFsCFhhmQys9-DkCuWOShsRgcxR1FtxDdKo1TglcRNHZ54iwTn0Vj0n5LhlE9rsnrFhXHOVeAzF-JhtZdvl6bEUVkXbdGh3xEt0kqlWJO1Np9dKVjSQHVY1h3keT4WHLoLaAIjHjbX_nBRqYpIb5eMm9uvsSMvYVno0Hz8Umb89_Ptkc4XX48-bCrRPtlyiwOy801q8JD0b-felzgao2h_9hqgNKSlzfSePkLyLKLSnyn8xtJRjTOk9zcDOqltJOhwqd87ZNK7m-T9oHZRCOZZXATCaRVrpn2KQjiWQyIiaDGuleVKiSjxsTdOCBaZBGJqZZp5k8FrrlJjlRYyPiLtxXLhjgn1sFagCrqLMlispPQRt8xC6GNmI3FCOhiT2WqtkzFrwnH6x_kzso2pqeB0yTlpF2-lu4ASX-jLKrWf94ylQg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF60HtSLr4pv9-A1bUx2N5ujRGta2yAYobeyTylKK5pc_PXupEl9gOAthA0sMySTmf0eCF34sXGFTUkvpCCqDfhGroTwrOBwqEOJ1UBwHmUsfSSDMR3XZPWKC2OMqcBnpgOX1Vm-nqsSRmVdsEV3_Q5bRWuUEEIXdK3lh5cznzRgHT_u9pMkyRh1_YVrAwPaaZ7-4aNSlZHeFsqaDSzQI8-dspAd9fFLm_HfO9xG7S_GHr5f1qIdtGJmu2jzm9jgHkqvp9aWMBzDYID24oE4pMbN_B4_uPRpwKU_Yfcji0c10hDfXvVxXhtJ4MFcvrdR3rvJk9SrfRS8aRwWHjNShNKXNgIpHE1dKgLXZFwKTYVgAbGhVYYxP1Ak4kzzKLYqdi-6iJQWkvFwH7Vm85k5QNi6tQx00E0Qu8WCcxtQ7WsX-tDXATtEbYjJ5HWhlDFpwnH0x_1ztJ7mo-Fk2M_ujtEGpKkC15ET1CreSnPqCn4hz6o0fwLQL6iP
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+International+Conference+on+Computer+Communications+and+Networks&rft.atitle=Diffusion+Model-aided+Resource+Scheduling+for+Multiple+GAI+Training+Jobs&rft.au=Yuan%2C+Meng&rft.au=Wu%2C+Qiang&rft.au=Wang%2C+Xiangbin&rft.au=Sun%2C+Siyang&rft.date=2025-08-04&rft.pub=IEEE&rft.eissn=2637-9430&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICCCN65249.2025.11134026&rft.externalDocID=11134026