Transforming the multifluid PPM algorithm to run on GPUs
In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts...
Saved in:
| Published in | Journal of parallel and distributed computing Vol. 93-94; no. C; pp. 56 - 65 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
United States
Elsevier Inc
01.07.2016
Elsevier |
| Subjects | |
| Online Access | Get full text |
| ISSN | 0743-7315 1096-0848 1096-0848 |
| DOI | 10.1016/j.jpdc.2016.04.005 |
Cover
| Summary: | In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts of data in on-chip storage. In previous work, we combined a briquette data structure and a heavily pipelined CFD processing of these data briquettes in sequence that results in a very small on-chip data workspace and high performance for our multifluid PPM gas dynamics algorithm on CPUs with standard sized caches. The on-chip data workspace produced in that earlier work is not small enough to meet the requirements of today’s GPUs, which demand that no more than 32 kB of on-chip data be associated with a single thread of control (a warp). Here we report a variant of our earlier technique that allows a user-controllable trade-off between workspace size and redundant computation that can be a win on GPUs. We use our multifluid PPM gas dynamics algorithm to illustrate this technique. Performance results for this algorithm in 32-bit precision on a recently introduced dual-chip GPU, the Nvidia K80, are 1.7 times that on a similarly recent dual CPU node using two 16-core Intel Haswell chips. The redundant computation that allows the on-chip data context for each thread of control to be less than 32 kB is roughly 9% of the total. We have built an automatic translator from a Fortran expression to CUDA to ease the programming burden that is involved in applying our technique.
•An optimization for limited workspace on the GPUs.•Allowing trade-off between workspace size and redundant computation.•Automatic translators to automate the optimizations.•Delivered 1.7 to 2.4 times speedups compared to the CPU systems.•Superior or comparable performance compared to other CFDs running on GPUs. |
|---|---|
| Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE 237111; 1254431; AC52-07NA27344; LLNL-JRNL-673849 |
| ISSN: | 0743-7315 1096-0848 1096-0848 |
| DOI: | 10.1016/j.jpdc.2016.04.005 |