An application-level failure detection algorithm based on a robust and efficient torus-tree for HPC

Applications must be fault-tolerant for future expansion to Exascale, and failure detection is an important prerequisite for the implementation of many fault-tolerant technologies. Implementing failure detection on large-scale High Performance Computing(HPC) systems is often difficult and expensive....

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom) pp. 484 - 492
Main Authors	Ye, Yingjun, Zhang, Yongdong, Ye, Weicai
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2021
Subjects	Computer crashes Failure detection Fault tolerance Fault tolerant systems Feature extraction HPC Maintenance engineering Robustness Topology Torus-Tree
Online Access	Get full text
DOI	10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00073

Cover

More Information
Summary:	Applications must be fault-tolerant for future expansion to Exascale, and failure detection is an important prerequisite for the implementation of many fault-tolerant technologies. Implementing failure detection on large-scale High Performance Computing(HPC) systems is often difficult and expensive. And there is a high probability that applications running on super-scale systems may fall into long and ineffective operation due to some undetectable faults. Therefore, we designed a failure detection algorithm at the application level that enables the application to detect abnormally slow or crashed processes in time. The processes invoked by an HPC application communicate according to some efficient topological overlays, however, the faulty processes can destroy the original communication structure. In this paper, we designed a torus-tree topology for the failure detection algorithm. If there are processes that fail, the surviving processes are reconnected to maintain the connectivity of the communication graph and execute fault propagation based on the repaired communication structure. We designed the torus-tree with the features of easy repair, robustness, and high communication efficiency, and used the detection idea of combining local timing and global counting to overcome the difficulty for process fault determination on the HPC system model. We experimentally verify the effectiveness of the detection algorithm.
DOI:	10.1109/ISPA-BDCloud-SocialCom-SustainCom52081.2021.00073