ScSF: A Scheduling Simulation Framework

High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedu...

Full description

Saved in:
Bibliographic Details
Published inJob Scheduling Strategies for Parallel Processing Vol. 10773; pp. 152 - 173
Main Authors Rodrigo, Gonzalo P., Elmroth, Erik, Östberg, Per-Olov, Ramakrishnan, Lavanya
Format Book Chapter Conference Proceeding
LanguageEnglish
Published Switzerland Springer International Publishing AG 01.01.2018
Springer International Publishing
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN9783319773971
3319773976
9783319773988
3319773984
ISSN0302-9743
1611-3349
1611-3349
DOI10.1007/978-3-319-77398-8_9

Cover

More Information
Summary:High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedulers are designed to support classical large tightly coupled parallel jobs over homogeneous systems. Therefore, there is an urgent need to investigate new scheduling algorithms that can manage the future workloads on HPC systems. However, there is a lack of appropriate models and frameworks to enable development, testing, and validation of new scheduling ideas. In this paper, we present an open-source scheduler simulation framework (ScSF) that covers all the steps of scheduling research through simulation. ScSF provides capabilities for workload modeling, workload generation, system simulation, comparative workload analysis, and experiment orchestration. The simulator is designed to be run over a distributed computing infrastructure facilitating large-scale tests. We demonstrate ScSF through a case study to develop new techniques to manage scientific workflows in a batch scheduler. The evaluation consisted of 1728 experiments and equivalent to 33 years of simulated time, were run in a deployment of ScSF over a distributed infrastructure of 17 compute nodes over two months. Finally, the experimental results were analyzed using the ScSF framework to demonstrate that our technique minimizes workflow turnaround time without over-allocating resources. Finally, we discuss lessons learned from our experiences to inform future large-scale simulation studies using ScSF and other similar frameworks.
Bibliography:Source code available to download at: http://frieda.lbl.gov/download.G. P. Rodrigo—Work performed while working at the Lawrence Berkeley National Lab.
ISBN:9783319773971
3319773976
9783319773988
3319773984
ISSN:0302-9743
1611-3349
1611-3349
DOI:10.1007/978-3-319-77398-8_9