Are We There Yet? Predicting the Queue Wait Times for HPC Jobs
Large high-performance computing systems are commonly shared among users that submit their workflows to a resource manager and scheduling framework such as SLURM. Most commonly available job schedulers provide built-in algorithms for performing job backfill and placement, where candidate jobs can be...
Saved in:
| Published in | Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 12 |
|---|---|
| Main Authors | , , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
02.09.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2168-9253 |
| DOI | 10.1109/CLUSTER59342.2025.11186489 |
Cover
| Summary: | Large high-performance computing systems are commonly shared among users that submit their workflows to a resource manager and scheduling framework such as SLURM. Most commonly available job schedulers provide built-in algorithms for performing job backfill and placement, where candidate jobs can be run out of order on currently free resources, provided that they do not negatively impact other jobs already waiting in the queue. Backfilling relies on two key requirements: 1) the user's own estimate of the runtime of their job and 2) the ability for the scheduler to create and maintain a future schedule of all jobs in the queue at any one moment. Unfortunately, user-provided estimates are often erroneous, a well-known problem in parallel job scheduling. These estimates cause the scheduler to plan jobs based on inaccurate data, which in turn causes the scheduler-provided estimates of user wait time to be quite inaccurate. As such, in this work, we leverage several machine learning (ML) techniques to provide a more accurate estimate of user waiting time and contrast them across a variety of metrics including wait time and bounded per-processor slowdown using simulated data based on real job workload traces. The presented machine learning models improve overall wait time estimation by a factor of 4.1 \times over traditional scheduler-provided wait times. |
|---|---|
| ISSN: | 2168-9253 |
| DOI: | 10.1109/CLUSTER59342.2025.11186489 |