Distributed and heterogeneous tensor–vector contraction algorithms for high performance computing

The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tenso...

Full description

Saved in:

Bibliographic Details
Published in	Future generation computer systems Vol. 166; p. 107698
Main Authors	Martinez-Ferrer, Pedro J., Yzelman, Albert-Jan, Beltran, Vicenç
Format	Journal Article
Language	English
Published	Elsevier B.V 01.05.2025
Subjects	Distributed memory GPU High bandwidth memory Mixed precision Task-based parallelization Tensor contraction Distributed memory Tensor contraction Task-based parallelization Mixed precision GPU High bandwidth memory
Online Access	Get full text
ISSN	0167-739X
DOI	10.1016/j.future.2024.107698

Cover

More Information
Summary:	The tensor–vector contraction (TVC) is the most memory-bound operation of its class and a core component of the higher-order power method (HOPM). This paper brings distributed-memory parallelization to a native TVC algorithm for dense tensors that overall remains oblivious to contraction mode, tensor splitting, and tensor order. Similarly, we propose a novel distributed HOPM, namely dHOPM3, that can save up to one order of magnitude of streamed memory and is about twice as costly in terms of data movement as a distributed TVC operation (dTVC) when using task-based parallelization. The numerical experiments carried out in this work on three different architectures featuring multicore and accelerators confirm that the performances of dTVC and dHOPM3 remain relatively close to the peak system memory bandwidth (50%–80%, depending on the architecture) and on par with STREAM benchmark figures. On strong scalability scenarios, our native multicore implementations of these two algorithms can achieve similar and sometimes even greater performance figures than those based upon state-of-the-art CUDA batched kernels. Finally, we demonstrate that both computation and communication can benefit from mixed precision arithmetic also in cases where the hardware does not support low precision data types natively. •A novel distributed tensor–vector contraction algorithm is introduced.•Analytical formulae are derived for minimal data movement and best throughput.•A best-in-class higher order power method with task-based parallelization is given.•Our dTVC and dHOPM3 algorithms reach 50% to 80% of the theoretical peak bandwidth.•Our cached, mixed precision ad-hoc kernels speedup computation and communication.
ISSN:	0167-739X
DOI:	10.1016/j.future.2024.107698