Static Approximation of MPI Communication Graphs for Optimized Process Placement
Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sens...
        Saved in:
      
    
          | Published in | Languages and Compilers for Parallel Computing pp. 268 - 283 | 
|---|---|
| Main Authors | , , | 
| Format | Book Chapter | 
| Language | English | 
| Published | 
        Cham
          Springer International Publishing
    
        2015
     | 
| Series | Lecture Notes in Computer Science | 
| Subjects | |
| Online Access | Get full text | 
| ISBN | 331917472X 9783319174723  | 
| ISSN | 0302-9743 1611-3349  | 
| DOI | 10.1007/978-3-319-17473-0_18 | 
Cover
| Summary: | Message Passing Interface (MPI) is the de facto standard for programming large scale parallel programs. Static understanding of MPI programs informs optimizations including process placement and communication/computation overlap, and debugging. In this paper, we present a fully context and flow sensitive, interprocedural, best-effort analysis framework to statically analyze MPI programs. We instantiate this to determine an approximation of the point-to-point communication graph of an MPI program. Our analysis is the first pragmatic approach to realizing the full point-to-point communication graph without profiling – indeed our experiments show that we are able to resolve and understand 100 % of the relevant MPI call sites across the NAS Parallel Benchmarks. In all but one case, this only requires specifying the number of processes.
To demonstrate an application, we use the analysis to determine process placement on a Chip MultiProcessor (CMP) based cluster. The use of a CMP-based cluster creates a two-tier system, where inter-node communication can be subject to greater latencies than intra-node communication. Intelligent process placement can therefore have a significant impact on the execution time. Using the 64 process versions of the benchmarks, and our analysis, we see an average of 28 % (7 %) improvement in communication localization over by-rank scheduling for 8-core (12-core) CMP-based clusters, representing the maximum possible improvement. | 
|---|---|
| ISBN: | 331917472X 9783319174723  | 
| ISSN: | 0302-9743 1611-3349  | 
| DOI: | 10.1007/978-3-319-17473-0_18 |