A massively parallel adaptive fast-multipole method on heterogeneous architectures

We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12
Main Authors	Lashuk, Ilya, Chandramowlishwaran, Aparna, Langston, Harper, Nguyen, Tuan-Anh, Sampath, Rahul, Shringarpure, Aashay, Vuduc, Richard, Ying, Lexing, Zorin, Denis, Biros, George
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 14.11.2009
Series	ACM Conferences
Subjects	Accuracy Approximation algorithms Computing methodologies > Computer graphics > Graphics systems and interfaces > Graphics processors Computing methodologies > Modeling and simulation > Model development and analysis Computing methodologies > Modeling and simulation > Model development and analysis > Model verification and validation Computing methodologies > Modeling and simulation > Model development and analysis > Modeling methodologies Computing methodologies > Parallel computing methodologies > Parallel algorithms Computing methodologies > Parallel computing methodologies > Parallel programming languages Faces Graphics processing units Kernel Microwave integrated circuits Reviews Scalability Servers Software and its engineering > Software notations and tools > General programming languages > Language types > Parallel programming languages Theory of computation > Design and analysis of algorithms > Parallel algorithms Vectors
Online Access	Get full text
ISBN	1605587443 9781605587448
ISSN	2167-4329
DOI	10.1145/1654059.1654118

Cover

More Information
Summary:	We present new scalable algorithms and a new implementation of our kernel-independent fast multipole method (Ying et al. ACM/IEEE SC '03), in which we employ both distributed memory parallelism (via MPI) and shared memory/streaming parallelism (via GPU acceleration) to rapidly evaluate two-body non-oscillatory potentials. On traditional CPU-only systems, our implementation scales well up to 30 billion unknowns on 65K cores (AMD/CRAY-based Kraken system at NSF/NICS) for highly non-uniform point distributions. On GPU-enabled systems, we achieve 30x speedup for problems of up to 256 million points on 256 GPUs (Lincoln at NSF/NCSA) over a comparable CPU-only based implementations. We achieve scalability to such extreme core counts by adopting a new approach to scalable MPI-based tree construction and partitioning, and a new reduction algorithm for the evaluation phase. For the sub-components of the evaluation phase (the direct- and approximate-interactions, the target evaluation, and the source-to-multipole translations), we use NVIDIA's CUDA framework for GPU acceleration to achieve excellent performance. To do so requires carefully constructed data structure transformations, which we describe in the paper and whose cost we show is minor. Taken together, these components show promise for ultrascalable FMM in the petascale era and beyond.
ISBN:	1605587443 9781605587448
ISSN:	2167-4329
DOI:	10.1145/1654059.1654118