ScSF: A Scheduling Simulation Framework
High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedu...
        Saved in:
      
    
          | Published in | Job Scheduling Strategies for Parallel Processing Vol. 10773; pp. 152 - 173 | 
|---|---|
| Main Authors | , , , | 
| Format | Book Chapter Conference Proceeding | 
| Language | English | 
| Published | 
        Switzerland
          Springer International Publishing AG
    
        01.01.2018
     Springer International Publishing  | 
| Series | Lecture Notes in Computer Science | 
| Subjects | |
| Online Access | Get full text | 
| ISBN | 9783319773971 3319773976 9783319773988 3319773984  | 
| ISSN | 0302-9743 1611-3349 1611-3349  | 
| DOI | 10.1007/978-3-319-77398-8_9 | 
Cover
| Summary: | High-throughput and data-intensive applications are increasingly present, often composed as workflows, in the workloads of current HPC systems. At the same time, trends for future HPC systems point towards more heterogeneous systems with deeper I/O and memory hierarchies. However, current HPC schedulers are designed to support classical large tightly coupled parallel jobs over homogeneous systems. Therefore, there is an urgent need to investigate new scheduling algorithms that can manage the future workloads on HPC systems. However, there is a lack of appropriate models and frameworks to enable development, testing, and validation of new scheduling ideas.
In this paper, we present an open-source scheduler simulation framework (ScSF) that covers all the steps of scheduling research through simulation. ScSF provides capabilities for workload modeling, workload generation, system simulation, comparative workload analysis, and experiment orchestration. The simulator is designed to be run over a distributed computing infrastructure facilitating large-scale tests. We demonstrate ScSF through a case study to develop new techniques to manage scientific workflows in a batch scheduler. The evaluation consisted of 1728 experiments and equivalent to 33 years of simulated time, were run in a deployment of ScSF over a distributed infrastructure of 17 compute nodes over two months. Finally, the experimental results were analyzed using the ScSF framework to demonstrate that our technique minimizes workflow turnaround time without over-allocating resources. Finally, we discuss lessons learned from our experiences to inform future large-scale simulation studies using ScSF and other similar frameworks. | 
|---|---|
| Bibliography: | Source code available to download at: http://frieda.lbl.gov/download.G. P. Rodrigo—Work performed while working at the Lawrence Berkeley National Lab. | 
| ISBN: | 9783319773971 3319773976 9783319773988 3319773984  | 
| ISSN: | 0302-9743 1611-3349 1611-3349  | 
| DOI: | 10.1007/978-3-319-77398-8_9 |