Static Approximation of MPI Communication Graphs for Optimized Process Placement

Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sens...

Full description

Saved in:
Bibliographic Details
Published inLanguages and Compilers for Parallel Computing pp. 268 - 283
Main Authors McPherson, Andrew J., Nagarajan, Vijay, Cintra, Marcelo
Format Book Chapter
LanguageEnglish
Published Cham Springer International Publishing 2015
SeriesLecture Notes in Computer Science
Subjects
Online AccessGet full text
ISBN331917472X
9783319174723
ISSN0302-9743
1611-3349
DOI10.1007/978-3-319-17473-0_18

Cover

More Information
Summary:Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sensitive, interprocedural, best-effort analysis framework to statically analyze MPI programs. We instantiate this to determine an approximation of the point-to-point communication graph of an MPI program. Our analysis is the first pragmatic approach to realizing the full point-to-point communication graph without profiling – indeed our experiments show that we are able to resolve and understand 100 % of the relevant MPI call sites across the NAS Parallel Benchmarks. In all but one case, this only requires specifying the number of processes. To demonstrate an application, we use the analysis to determine process placement on a Chip MultiProcessor (CMP) based cluster. The use of a CMP-based cluster creates a two-tier system, where inter-node communication can be subject to greater latencies than intra-node communication. Intelligent process placement can therefore have a significant impact on the execution time. Using the 64 process versions of the benchmarks, and our analysis, we see an average of 28 % (7 %) improvement in communication localization over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement.
ISBN:331917472X
9783319174723
ISSN:0302-9743
1611-3349
DOI:10.1007/978-3-319-17473-0_18