Fusion of parallel array operations

We address the problem of fusing array operations based on criteria such as shape compatibility, data reuse, and minimizing communication. We formulate the problem as a partitioning problem (WSP) that is general enough to handle loop fusion, combinator fusion, and other types of fusion analysis. Tra...

Full description

Saved in:

Bibliographic Details
Published in	2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) pp. 71 - 85
Main Authors	Kristensen, Mads R. B., Lund, Simon A. F., Blum, Troels, Avery, James
Format	Conference Proceeding
Language	English
Published	ACM 01.09.2016
Subjects	Approximation algorithms Arrays Cost function Fuses Law
Online Access	Get full text
DOI	10.1145/2967938.2967945

Cover

More Information
Summary:	We address the problem of fusing array operations based on criteria such as shape compatibility, data reuse, and minimizing communication. We formulate the problem as a partitioning problem (WSP) that is general enough to handle loop fusion, combinator fusion, and other types of fusion analysis. Traditionally, when optimizing for data reuse, the fusion problem has been formulated as a static weighted graph partitioning problem (known as the Weighted Loop Fusion problem). We show that this scheme cannot accurately track data reuse between multiple independent loops, since it overestimates total data reuse of certain cases. Our formulation in terms of partitions allows use of realistic cost functions that can track resource usage accurately. We give correctness proofs, and prove that WSP can maximize data reuse in programs exactly, in contrast to prior work. For the exact optimal solution, which is NP-hard to find, we present a branch-and-bound algorithm together with a polynomial-time preconditioner that reduces the problem size significantly in practice. We further present a polynomialtime greedy approximation that is fast enough to use for JIT-compilation and gives near-optimal results in practice. All algorithms have been implemented in the automatic parallelization platform Bohrium, run on a set of benchmarks, and compared to existing methods from the literature.
DOI:	10.1145/2967938.2967945