Compiler-controlled extraction of computation-communication overlap in MPI applications

Exploiting computation-communication overlap is a well- known requirement to speed up distributed applications. However, efforts till now use programmer expertise, rather than any automatic tool to do this. In our work we propose the use of an aggressive optimizing compiler (IBM's xl series) to...

Full description

Saved in:
Bibliographic Details
Published in2008 IEEE International Symposium on Parallel and Distributed Processing pp. 1 - 8
Main Authors Das, D., Gupta, M., Ravindran, R., Shivani, W., Sivakeshava, P., Uppal, R.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.04.2008
Subjects
Online AccessGet full text
ISBN1424416930
9781424416936
ISSN1530-2075
DOI10.1109/IPDPS.2008.4536193

Cover

More Information
Summary:Exploiting computation-communication overlap is a well- known requirement to speed up distributed applications. However, efforts till now use programmer expertise, rather than any automatic tool to do this. In our work we propose the use of an aggressive optimizing compiler (IBM's xl series) to automatically extract opportunities for computation communication overlap. We depend on aggressive inlining, dominator trees and SSA based use-def analyses provided by the compiler framework for exploiting such overlap. Our target is MPI applications. In such applications, we try to automatically move mpi_waits as well as split blocking mpi_send/recv to create more opportunities for overlap. Our objective is two-fold: firstly, our tool should relieve the programmer from the burden of hunting for overlap manually as much as possible, and secondly, it should aid in converging on parallel applications which benefit from such overlap quickly. These are necessary as MPI applications are quickly becoming complex and huge and manual overlap extraction is becoming cumbersome. Our early experience shows that it is not necessary that exploiting an overlap always leads to performance improvement. This corroborates with the fact that if we have an automatic tool, then, we can quickly discard such applications (or certain configurations of such applications) without spending person-hours to manually rewrite MPI applications for introducing non-blocking calls. Our initial experiments with the industry-standard NAS parallel benchmarks show that we can get small-to-moderate improvements by utilizing overlap even in such highly tuned benchmarks. This augurs well for real-world applications that do not exploit overlap optimally.
ISBN:1424416930
9781424416936
ISSN:1530-2075
DOI:10.1109/IPDPS.2008.4536193