Auto-tuning 3-D FFT library for CUDA GPUs

Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 10
Main Authors	Nukada, Akira, Matsuoka, Satoshi
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 14.11.2009
Series	ACM Conferences
Subjects	Bandwidth Codes Computing methodologies > Computer graphics > Graphics systems and interfaces > Graphics processors Graphics processing units Hardware Instruction sets Kernel Mathematics of computing > Mathematical analysis > Functional analysis > Approximation Mathematics of computing > Mathematical analysis > Numerical analysis > Computation of transforms Mathematics of computing > Mathematical software Programming Registers Theory of computation > Design and analysis of algorithms > Approximation algorithms analysis Three-dimensional displays Transforms
Online Access	Get full text
ISBN	1605587443 9781605587448
ISSN	2167-4329
DOI	10.1145/1654059.1654090

Cover

More Information
Summary:	Existing implementations of FFTs on GPUs are optimized for specific transform sizes like powers of two, and exhibit unstable and peaky performance i.e., do not perform as well in other sizes that appear in practice. Our new auto-tuning 3-D FFT on CUDA generates high performance CUDA kernels for FFTs of varying transform sizes, alleviating this problem. Although auto-tuning has been implemented on GPUs for dense kernels such as DGEMM and stencils, this is the first instance that has been applied comprehensively to bandwidth intensive and complex kernels such as 3-D FFTs. Bandwidth intensive optimizations such as selecting the number of threads and inserting padding to avoid bank conflicts on shared memory are systematically applied. Our resulting autotuner is fast and results in performance that essentially beats all 3-D FFT implementations on a single processor to date, and moreover exhibits stable performance irrespective of problem sizes or the underlying GPU hardware.
ISBN:	1605587443 9781605587448
ISSN:	2167-4329
DOI:	10.1145/1654059.1654090