Parallelized GPU Code of City-Level Large Eddy Simulation

In this paper, we describe the GPU implementation of our City-LES code, which is developed at the Center for Computational Sciences (CCS), University of Tsukuba for detailed large eddy simulations, including surface conditions such as buildings, surface materials, and sunlight effect Wefocus on the...

Full description

Saved in:

Bibliographic Details
Published in	2020 19th International Symposium on Parallel and Distributed Computing (ISPDC) pp. 76 - 83
Main Authors	Tsuji, Daisuke, Boku, Taisuke, Ikeda, Ryosaku, Sato, Takuto, Tadano, Hiroto, Kusaka, Hiroyuki
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2020
Subjects	climate cluster CUDA Distributed computing GPU Hafnium large eddy simulation OpenACC
Online Access	Get full text
DOI	10.1109/ISPDC51135.2020.00020

Cover

More Information
Summary:	In this paper, we describe the GPU implementation of our City-LES code, which is developed at the Center for Computational Sciences (CCS), University of Tsukuba for detailed large eddy simulations, including surface conditions such as buildings, surface materials, and sunlight effect Wefocus on the 1) performance comparison between CUDA and OpenACC, and 2) how to reduce the data exchange between CPU and GPU memories. Using a number of GPU devices of NVIDIA Tesla V 100, we found that the current OpenACC compiler by PGI can achieve a comparable performance with CUDA in the main part of the LES calculation. We also apply OpenACC aggressively even for performances that are lower than that of a CPU to avoid data copying between the GPU and CPU, encapsulating all the data only on the GPU memory.In our optimized OpenACC (partially in CUDA) code, the results show that the performance of the full GPU version of code is doubled, and most of the GPU-CPU data copying is removed from the original GPU code. For the scaling performance test, a full GPU version achieves a 4. 7 x to 10xperformance of the CPU version; this is done on a GPU cluster Cygnus at CCS, where each node is equipped with two Intel Xeon CPUs and four NVIDIA Tesla V100 GPUs, with strong scaling up to 32 nodes with 128 GPUs. For weak scaling, the full GPU version achieves a performance of more than 9x that of the CPU version for up to 32 nodes with 128 GPUs of parallel execution.
DOI:	10.1109/ISPDC51135.2020.00020