NAS Parallel Benchmarks with Python: a performance and programming effort analysis focusing on GPUs

Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, w...

Full description

Saved in:

Bibliographic Details
Published in	The Journal of supercomputing Vol. 79; no. 8; pp. 8890 - 8911
Main Authors	Di Domenico, Daniel, Lima, João V. F., Cavalheiro, Gerson G. H.
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2023 Springer Nature B.V
Subjects	Algorithms Allocations Artificial intelligence Benchmarks C plus plus C++ (programming language) Compilers Computer Science Experiments Fourier transforms High level languages Interpreters Machine learning Monte Carlo simulation Partial differential equations Processor Architectures Programming Programming Languages Python NPB Numba GPU Programming effort Python
Online Access	Get full text
ISSN	0920-8542 1573-0484 1573-0484
DOI	10.1007/s11227-022-04932-3

Cover

More Information
Summary:	Compiled low-level languages, such as C/C++ and Fortran, have been employed as programming tools to implement applications to explore GPU devices. As a counterpoint to that trend, this paper presents a performance and programming effort analysis with Python, an interpreted and high-level language, which was applied to develop the kernels and applications of NAS Parallel Benchmarks targeting GPUs. We used Numba environment to enable CUDA support in Python, a tool that allows us to implement the GPU programs with pure Python code. Our experimental results showed that Python applications reached a performance similar to C++ programs employing CUDA and better than C++ using OpenACC for most NPB benchmarks. Furthermore, Python codes demanded less operations related to the GPU framework than CUDA, mainly because Python needs a lower number of statements to manage memory allocations and data transfers. Despite that, our Python implementations required more operations than OpenACC ones.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0920-8542 1573-0484 1573-0484
DOI:	10.1007/s11227-022-04932-3