Minimizing communication in sparse matrix solvers

Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis pp. 1 - 12
Main Authors	Mohiyuddin, Marghoob, Hoemmen, Mark, Demmel, James, Yelick, Katherine
Format	Conference Proceeding
Language	English
Published	New York, NY, USA ACM 14.11.2009
Series	ACM Conferences
Subjects	Arithmetic Computing methodologies > Concurrent computing methodologies > Concurrent programming languages Computing methodologies > Symbolic and algebraic manipulation > Symbolic and algebraic algorithms > Linear algebra algorithms Convergence Costs Kernel Mathematics of computing > Mathematical analysis > Numerical analysis > Computations on matrices Multicore processing Numerical stability Parallel algorithms Program processors Software and its engineering > Software notations and tools > General programming languages > Language types > Concurrent programming languages Sparse matrices Theory of computation > Models of computation > Concurrency Theory of computation > Models of computation > Concurrency > Parallel computing models Vectors
Online Access	Get full text
ISBN	1605587443 9781605587448
ISSN	2167-4329
DOI	10.1145/1654059.1654096

Cover

More Information
Summary:	Data communication within the memory system of a single processor node and between multiple nodes in a system is the bottleneck in many iterative sparse matrix solvers like CG and GMRES. Here k iterations of a conventional implementation perform k sparse-matrix-vector-multiplications and Ω(k) vector operations like dot products, resulting in communication that grows by a factor of Ω(k) in both the memory and network. By reorganizing the sparse-matrix kernel to compute a set of matrix-vector products at once and reorganizing the rest of the algorithm accordingly, we can perform k iterations by sending O(log P) messages instead of O(k · log P) messages on a parallel machine, and reading the matrix A from DRAM to cache just once, instead of k times on a sequential machine. This reduces communication to the minimum possible. We combine these techniques to form a new variant of GMRES. Our shared-memory implementation on an 8-core Intel Clovertown gets speedups of up to 4.3x over standard GMRES, without sacrificing convergence rate or numerical stability.
ISBN:	1605587443 9781605587448
ISSN:	2167-4329
DOI:	10.1145/1654059.1654096