Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GP...

Full description

Saved in:

Bibliographic Details
Published in	Scientific programming Vol. 2020; no. 2020; pp. 1 - 15
Main Authors	Li, Hua, Tian, Zhengyu, Yu, Hang, Lai, Jianqi
Format	Journal Article
Language	English
Published	Cairo, Egypt Hindawi Publishing Corporation 2020 Hindawi John Wiley & Sons, Inc
Subjects	Aircraft Algorithms Central processing units Clusters Communication Compressibility Computational fluid dynamics Computer architecture Computer memory CPUs Data exchange Data processing Data transfer (computers) Decomposition Design Discretization Domain decomposition methods Efficiency Field programmable gate arrays Finite volume method Flat plates Floating point arithmetic Fluid flow General circulation models Graphics processing units Mathematical models Message passing Optimization techniques Parallel processing Runge-Kutta method Simulation Turbulence models Turbulent flow Workloads
Online Access	Get full text
ISSN	1058-9244 1875-919X 1875-919X
DOI	10.1155/2020/8862123

Cover

More Information
Summary:	Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K−ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1058-9244 1875-919X 1875-919X
DOI:	10.1155/2020/8862123