Static Approximation of MPI Communication Graphs for Optimized Process Placement

Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sens...

Full description

Saved in:

Bibliographic Details
Published in	Languages and Compilers for Parallel Computing pp. 268 - 283
Main Authors	McPherson, Andrew J., Nagarajan, Vijay, Cintra, Marcelo
Format	Book Chapter
Language	English
Published	Cham Springer International Publishing 2015
Series	Lecture Notes in Computer Science
Subjects	Communication Graph Execution Path Flow Sensitivity Message Passing Interface Process Placement
Online Access	Get full text
ISBN	331917472X 9783319174723
ISSN	0302-9743 1611-3349
DOI	10.1007/978-3-319-17473-0_18

Cover

More Information
Summary:	Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sensitive, interprocedural, best-effort analysis framework to statically analyze MPI programs. We instantiate this to determine an approximation of the point-to-point communication graph of an MPI program. Our analysis is the first pragmatic approach to realizing the full point-to-point communication graph without profiling – indeed our experiments show that we are able to resolve and understand 100 % of the relevant MPI call sites across the NAS Parallel Benchmarks. In all but one case, this only requires specifying the number of processes. To demonstrate an application, we use the analysis to determine process placement on a Chip MultiProcessor (CMP) based cluster. The use of a CMP-based cluster creates a two-tier system, where inter-node communication can be subject to greater latencies than intra-node communication. Intelligent process placement can therefore have a significant impact on the execution time. Using the 64 process versions of the benchmarks, and our analysis, we see an average of 28 % (7 %) improvement in communication localization over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement.
ISBN:	331917472X 9783319174723
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-319-17473-0_18