Are We There Yet? Predicting the Queue Wait Times for HPC Jobs

Large high-performance computing systems are commonly shared among users that submit their workflows to a resource manager and scheduling framework such as SLURM. Most commonly available job schedulers provide built-in algorithms for performing job backfill and placement, where candidate jobs can be...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 12
Main Authors	Whitton, Christin, Jones, William, Walker, Craig, Job, Vanessa, Senator, Steven, DeBardeleben, Nathan
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Cluster computing Computational modeling High performance computing HPC Machine learning Maximum likelihood estimation Measurement Out of order parallel job scheduling Processor scheduling Runtime Schedules
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186489

Cover

More Information
Summary:	Large high-performance computing systems are commonly shared among users that submit their workflows to a resource manager and scheduling framework such as SLURM. Most commonly available job schedulers provide built-in algorithms for performing job backfill and placement, where candidate jobs can be run out of order on currently free resources, provided that they do not negatively impact other jobs already waiting in the queue. Backfilling relies on two key requirements: 1) the user's own estimate of the runtime of their job and 2) the ability for the scheduler to create and maintain a future schedule of all jobs in the queue at any one moment. Unfortunately, user-provided estimates are often erroneous, a well-known problem in parallel job scheduling. These estimates cause the scheduler to plan jobs based on inaccurate data, which in turn causes the scheduler-provided estimates of user wait time to be quite inaccurate. As such, in this work, we leverage several machine learning (ML) techniques to provide a more accurate estimate of user waiting time and contrast them across a variety of metrics including wait time and bounded per-processor slowdown using simulated data based on real job workload traces. The presented machine learning models improve overall wait time estimation by a factor of 4.1 \times over traditional scheduler-provided wait times.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186489