Diffusion Model-aided Resource Scheduling for Multiple GAI Training Jobs
With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed...
Saved in:
| Published in | Proceedings - International Conference on Computer Communications and Networks pp. 1 - 6 |
|---|---|
| Main Authors | , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
04.08.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2637-9430 |
| DOI | 10.1109/ICCCN65249.2025.11134026 |
Cover
| Abstract | With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed training exacerbate resource fragmentation and network contention, resulting in low utilization and high latency. To this end, we propose an intelligent and dynamic resource scheduling method. Firstly, we propose an innovative scheduling analytical model that describes heterogeneous computing resources, communication contention, and the parameter synchronization architecture. We then formulate it as a multi-objective optimization problem. Next, we propose a Diffusion Model-based AI-generated Resource Scheduling (DARS) algorithm, to capture dynamic and high-dimensional environment and generate the optimal scheduling decisions. Finally, the policy network of deep reinforcement learning (DRL) is replaced with the proposed DARS to address the environmental uncertainty and enhance efficiency. Simulation results demonstrate that our proposed algorithm outperforms associated algorithms. |
|---|---|
| AbstractList | With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has become crucial for pursuing higher cost-effectiveness. However, the resource-intensive nature and frequent communication demands of distributed training exacerbate resource fragmentation and network contention, resulting in low utilization and high latency. To this end, we propose an intelligent and dynamic resource scheduling method. Firstly, we propose an innovative scheduling analytical model that describes heterogeneous computing resources, communication contention, and the parameter synchronization architecture. We then formulate it as a multi-objective optimization problem. Next, we propose a Diffusion Model-based AI-generated Resource Scheduling (DARS) algorithm, to capture dynamic and high-dimensional environment and generate the optimal scheduling decisions. Finally, the policy network of deep reinforcement learning (DRL) is replaced with the proposed DARS to address the environmental uncertainty and enhance efficiency. Simulation results demonstrate that our proposed algorithm outperforms associated algorithms. |
| Author | Wang, Xiangbin Sun, Siyang Yuan, Meng Wu, Qiang |
| Author_xml | – sequence: 1 givenname: Meng surname: Yuan fullname: Yuan, Meng email: yuan_m@nuaa.edu.cn organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China – sequence: 2 givenname: Qiang surname: Wu fullname: Wu, Qiang email: wu.qiang@nuaa.edu.cn organization: Zhejiang University,College of Computer Science and Technology,Hangzhou,China – sequence: 3 givenname: Xiangbin surname: Wang fullname: Wang, Xiangbin email: Leo_Wang_XB@nuaa.edu.cn organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China – sequence: 4 givenname: Siyang surname: Sun fullname: Sun, Siyang email: siyang.sun@nuaa.edu.cn organization: Nanjing University of Aeronautics and Astronautics,College of Computer Science and Technology,Nanjing,China |
| BookMark | eNo1j81KAzEURqMo2Na-gYu8wNT8zNwkyzJqO9IqaPclk9xoZJyUSWfh21tRVwe-Ax-cKbnoU4-EUM4WnDNz29R1_QSVKM1CMFGdRi5LJuCMzI0yWkpeMW20OCcTAVIVppTsikxz_mCMaWDlhKzvYghjjqmn2-SxK2z06OkL5jQODumre0c_drF_oyENdDt2x3jokK6WDd0NNvY_5jG1-ZpcBttlnP9xRnYP97t6XWyeV0293BTRyGMB2FrZsjYobsD4ylkrFGhufWUtiDLI4BCACVcqDV4rE9wpRFjlvG1Byxm5-b2NiLg_DPHTDl_7_3D5DfKlT3g |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/ICCCN65249.2025.11134026 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISBN | 9798331508982 |
| EISSN | 2637-9430 |
| EndPage | 6 |
| ExternalDocumentID | 11134026 |
| Genre | orig-research |
| GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: National Key Research and Development Program of China funderid: 10.13039/501100012166 |
| GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI OCL RIE RIL RNS |
| ID | FETCH-LOGICAL-i93t-6eba3b0bf71969d5caa27681ad5aa624f3fce6602c4786d879fc9832a7cdab683 |
| IEDL.DBID | RIE |
| IngestDate | Wed Sep 03 07:09:37 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i93t-6eba3b0bf71969d5caa27681ad5aa624f3fce6602c4786d879fc9832a7cdab683 |
| PageCount | 6 |
| ParticipantIDs | ieee_primary_11134026 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-Aug.-4 |
| PublicationDateYYYYMMDD | 2025-08-04 |
| PublicationDate_xml | – month: 08 year: 2025 text: 2025-Aug.-4 day: 04 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings - International Conference on Computer Communications and Networks |
| PublicationTitleAbbrev | ICCCN |
| PublicationYear | 2025 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0008604 |
| Score | 2.3018243 |
| Snippet | With the prosperity of AI-Generated Content (AIGC), efficiently scheduling multiple Generative AI (GAI) distributed training jobs in a computing cluster has... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | AIGC Clustering algorithms Computational modeling Deep reinforcement learning diffusion model Dynamic scheduling Heuristic algorithms multiple GAI distributed training jobs Optimal scheduling Processor scheduling resource scheduling Synchronization Training Uncertainty |
| Title | Diffusion Model-aided Resource Scheduling for Multiple GAI Training Jobs |
| URI | https://ieeexplore.ieee.org/document/11134026 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT3rxVfHNHrwmXZPsZnOUaG0LLYIVeiv7lKK0RZOLv96ZNKkPELyFsCFhhmQys9-DkCuWOShsRgcxR1FtxDdKo1TglcRNHZ54iwTn0Vj0n5LhlE9rsnrFhXHOVeAzF-JhtZdvl6bEUVkXbdGh3xEt0kqlWJO1Np9dKVjSQHVY1h3keT4WHLoLaAIjHjbX_nBRqYpIb5eMm9uvsSMvYVno0Hz8Umb89_Ptkc4XX48-bCrRPtlyiwOy801q8JD0b-felzgao2h_9hqgNKSlzfSePkLyLKLSnyn8xtJRjTOk9zcDOqltJOhwqd87ZNK7m-T9oHZRCOZZXATCaRVrpn2KQjiWQyIiaDGuleVKiSjxsTdOCBaZBGJqZZp5k8FrrlJjlRYyPiLtxXLhjgn1sFagCrqLMlispPQRt8xC6GNmI3FCOhiT2WqtkzFrwnH6x_kzso2pqeB0yTlpF2-lu4ASX-jLKrWf94ylQg |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF60HtSLr4pv9-A1bUx2N5ujRGta2yAYobeyTylKK5pc_PXupEl9gOAthA0sMySTmf0eCF34sXGFTUkvpCCqDfhGroTwrOBwqEOJ1UBwHmUsfSSDMR3XZPWKC2OMqcBnpgOX1Vm-nqsSRmVdsEV3_Q5bRWuUEEIXdK3lh5cznzRgHT_u9pMkyRh1_YVrAwPaaZ7-4aNSlZHeFsqaDSzQI8-dspAd9fFLm_HfO9xG7S_GHr5f1qIdtGJmu2jzm9jgHkqvp9aWMBzDYID24oE4pMbN_B4_uPRpwKU_Yfcji0c10hDfXvVxXhtJ4MFcvrdR3rvJk9SrfRS8aRwWHjNShNKXNgIpHE1dKgLXZFwKTYVgAbGhVYYxP1Ak4kzzKLYqdi-6iJQWkvFwH7Vm85k5QNi6tQx00E0Qu8WCcxtQ7WsX-tDXATtEbYjJ5HWhlDFpwnH0x_1ztJ7mo-Fk2M_ujtEGpKkC15ET1CreSnPqCn4hz6o0fwLQL6iP |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+International+Conference+on+Computer+Communications+and+Networks&rft.atitle=Diffusion+Model-aided+Resource+Scheduling+for+Multiple+GAI+Training+Jobs&rft.au=Yuan%2C+Meng&rft.au=Wu%2C+Qiang&rft.au=Wang%2C+Xiangbin&rft.au=Sun%2C+Siyang&rft.date=2025-08-04&rft.pub=IEEE&rft.eissn=2637-9430&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICCCN65249.2025.11134026&rft.externalDocID=11134026 |