Checkpointing Workflows for Fail-Stop Errors

We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem c...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on computers Vol. 67; no. 8; pp. 1105 - 1120
Main Authors	Han, Li, Canon, Louis-Claude, Casanova, Henri, Robert, Yves, Vivien, Frederic
Format	Journal Article
Language	English
Published	New York IEEE 01.08.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Institute of Electrical and Electronics Engineers
Subjects	Algorithms checkpoint Checkpointing Computer Science Data Structures and Algorithms Distributed, Parallel, and Cluster Computing Dynamic programming fail-stop error Failure Graph theory Graphs Heuristic algorithms Microprocessors Probabilistic logic Processors Program processors resilience Schedules Task analysis Task scheduling Workflow Workflow software checkpoint workflow resilience fail-stop error
Online Access	Get full text
ISSN	0018-9340 1557-9956 2326-3814 1557-9956
DOI	10.1109/TC.2018.2801300

Cover

More Information
Summary:	We consider the problem of orchestrating the execution of workflow applications structured as Directed Acyclic Graphs (DAGs) on parallel computing platforms that are subject to fail-stop failures. The objective is to minimize expected overall execution time, or makespan. A solution to this problem consists of a schedule of the workflow tasks on the available processors and of a decision of which application data to checkpoint to stable storage, so as to mitigate the impact of processor failures. To address this challenge, we consider a restricted class of graphs, Minimal Series-Parallel Graphs ( M-SPGs ), which is relevant to many real-world workflow applications. For this class of graphs, we propose a recursive list-scheduling algorithm that exploits the M-SPG structure to assign sub-graphs to individual processors, and uses dynamic programming to decide how to checkpoint these sub-graphs. We assess the performance of our algorithm for production workflow configurations, comparing it to an approach in which all application data is checkpointed and an approach in which no application data is checkpointed. Results demonstrate that our algorithm outperforms both the former approach, because of lower checkpointing overhead, and the latter approach, because of better resilience to failures.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0018-9340 1557-9956 2326-3814 1557-9956
DOI:	10.1109/TC.2018.2801300