Transforming the multifluid PPM algorithm to run on GPUs

In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts...

Full description

Saved in:

Bibliographic Details
Published in	Journal of parallel and distributed computing Vol. 93-94; no. C; pp. 56 - 65
Main Authors	Lin, Pei-Hung, Woodward, Paul R.
Format	Journal Article
Language	English
Published	United States Elsevier Inc 01.07.2016 Elsevier
Subjects	Algorithms Central processing units Code transformation Computation Computational fluid dynamics Distributed processing Gas dynamics GPU computation Numerical analysis Precompilation Redundant Warp GPU computation Computational fluid dynamics Precompilation Code transformation
Online Access	Get full text
ISSN	0743-7315 1096-0848 1096-0848
DOI	10.1016/j.jpdc.2016.04.005

Cover

More Information
Summary:	In the past several years, there has been much success in adapting numerical algorithms involving linear algebra and pairwise N-body force calculations to run well on GPUs. These numerical algorithms share the feature that high computational intensity can be achieved while holding only small amounts of data in on-chip storage. In previous work, we combined a briquette data structure and a heavily pipelined CFD processing of these data briquettes in sequence that results in a very small on-chip data workspace and high performance for our multifluid PPM gas dynamics algorithm on CPUs with standard sized caches. The on-chip data workspace produced in that earlier work is not small enough to meet the requirements of today’s GPUs, which demand that no more than 32 kB of on-chip data be associated with a single thread of control (a warp). Here we report a variant of our earlier technique that allows a user-controllable trade-off between workspace size and redundant computation that can be a win on GPUs. We use our multifluid PPM gas dynamics algorithm to illustrate this technique. Performance results for this algorithm in 32-bit precision on a recently introduced dual-chip GPU, the Nvidia K80, are 1.7 times that on a similarly recent dual CPU node using two 16-core Intel Haswell chips. The redundant computation that allows the on-chip data context for each thread of control to be less than 32 kB is roughly 9% of the total. We have built an automatic translator from a Fortran expression to CUDA to ease the programming burden that is involved in applying our technique. •An optimization for limited workspace on the GPUs.•Allowing trade-off between workspace size and redundant computation.•Automatic translators to automate the optimizations.•Delivered 1.7 to 2.4 times speedups compared to the CPU systems.•Superior or comparable performance compared to other CFDs running on GPUs.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 USDOE 237111; 1254431; AC52-07NA27344; LLNL-JRNL-673849
ISSN:	0743-7315 1096-0848 1096-0848
DOI:	10.1016/j.jpdc.2016.04.005