A Proactive Fault Tolerance Approach to High Performance Computing (HPC) in the Cloud
Cloud computing offers new computing paradigms, capacity, and flexibility to high performance computing (HPC) applications with provisioning of a large number of Virtual Machines (VMs) for computation-intensive applications using the Hardware as a Service (HaaS) model. Due, however, to the large num...
Saved in:
| Published in | 2012 International Conference on Cloud and Green Computing pp. 268 - 273 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
01.11.2012
|
| Subjects | |
| Online Access | Get full text |
| ISBN | 1467330272 9781467330275 |
| DOI | 10.1109/CGC.2012.22 |
Cover
| Summary: | Cloud computing offers new computing paradigms, capacity, and flexibility to high performance computing (HPC) applications with provisioning of a large number of Virtual Machines (VMs) for computation-intensive applications using the Hardware as a Service (HaaS) model. Due, however, to the large number of VMs and electronic components in HPC systems in the cloud, any fault during the execution would result in re-running the application, which will cost time, money and energy. In this paper we present a proactive Fault Tolerance (FT) approach to HPC systems in the cloud to reduce the wall clock execution time in the presence of faults. We develop a generic FT algorithm for HPC systems in the cloud. Our algorithm does not rely on a spare node prior to prediction of a failure. We analyze the dollar cost of provisioning spare nodes to assess the value of our approach. Our experimental results obtained from a real cloud execution environment show that the wall clock execution time of the computation-intensive applications in cloud can be reduced by as much as 30%. The frequency of check pointing of computation-intensive applications can be reduced to 50% with our fault tolerance approach for HPC in the cloud, compared to current FT approaches. |
|---|---|
| ISBN: | 1467330272 9781467330275 |
| DOI: | 10.1109/CGC.2012.22 |