A Tight I/O Lower Bound for Matrix Multiplication

A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ ,...

Full description

Saved in:

Bibliographic Details
Main Authors	Smith, Tyler Michael, Lowery, Bradley, Langou, Julien, van de Geijn, Robert A
Format	Journal Article
Language	English
Published	03.02.2017
Subjects	Computer Science - Computational Complexity
Online Access	Get full text
DOI	10.48550/arxiv.1702.02017

Cover

Abstract	A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ , where each segment is a series of operations involving$M$reads and writes to and from fast memory, and$M$is the size of fast memory. A lower bound on the number of segments was then determined by obtaining an upper bound on the number of elementary multiplications performed per segment. This paper follows the same high level approach, but improves the lower bound by (1) transforming algorithms for MMM so that they perform all computation via fused multiply-add instructions (FMAs) and using this to reason about only the cost associated with reading the matrices, and (2) decoupling the per-segment I/O cost from the size of fast memory. For$n \times n$matrices, the lower bound's leading-order term is$2n^3/\sqrt{M}$ . A theoretical algorithm whose leading terms attains this is introduced. To what extent the state-of-the-art Goto's Algorithm attains the lower bound is discussed.
AbstractList	A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ , where each segment is a series of operations involving$M$reads and writes to and from fast memory, and$M$is the size of fast memory. A lower bound on the number of segments was then determined by obtaining an upper bound on the number of elementary multiplications performed per segment. This paper follows the same high level approach, but improves the lower bound by (1) transforming algorithms for MMM so that they perform all computation via fused multiply-add instructions (FMAs) and using this to reason about only the cost associated with reading the matrices, and (2) decoupling the per-segment I/O cost from the size of fast memory. For$n \times n$matrices, the lower bound's leading-order term is$2n^3/\sqrt{M}$ . A theoretical algorithm whose leading terms attains this is introduced. To what extent the state-of-the-art Goto's Algorithm attains the lower bound is discussed.
Author	Langou, Julien Smith, Tyler Michael Lowery, Bradley van de Geijn, Robert A
Author_xml	– sequence: 1 givenname: Tyler Michael surname: Smith fullname: Smith, Tyler Michael – sequence: 2 givenname: Bradley surname: Lowery fullname: Lowery, Bradley – sequence: 3 givenname: Julien surname: Langou fullname: Langou, Julien – sequence: 4 givenname: Robert A surname: van de Geijn fullname: van de Geijn, Robert A
BackLink	https://doi.org/10.48550/arXiv.1702.02017$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzQ3MNIzMDIwNOdkMHRUCMlMzyhR8NT3V_DJL08tUnDKL81LUUjLL1LwTSwpyqxQ8C3NKcksyMlMTizJzM_jYWBNS8wpTuWF0twM8m6uIc4eumDD4wuKMnMTiyrjQZbEgy0xJqwCABnnMdI
ContentType	Journal Article
Copyright	http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml	– notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID	AKY GOX
DOI	10.48550/arxiv.1702.02017
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	1702_02017
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_1702_020173
IEDL.DBID	GOX
IngestDate	Tue Sep 30 19:10:47 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_1702_020173
OpenAccessLink	https://arxiv.org/abs/1702.02017
ParticipantIDs	arxiv_primary_1702_02017
PublicationCentury	2000
PublicationDate	2017-02-03
PublicationDateYYYYMMDD	2017-02-03
PublicationDate_xml	– month: 02 year: 2017 text: 2017-02-03 day: 03
PublicationDecade	2010
PublicationYear	2017
Score	3.233282
SecondaryResourceType	preprint
Snippet	A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computational Complexity
Title	A Tight I/O Lower Bound for Matrix Multiplication
URI	https://arxiv.org/abs/1702.02017
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NawIxEB3UUy-lxRZrbZ1Dr0FjssoetVSsuPViYW9LPqGXUqwWf76ZZMVePAWSIQPJ4b3JzLwAvAhtlLejCfNGGia9tSzXzjOhxnmmlJMmJtqLj_HiUy7LrGwAnnph1Pbw9Zf0gfXvgFOzVCA0fNKEZiAK1My7LlNyMkpx1fZnu8Ax49Q_kJjfwHXN7nCaruMWGu67DXyKGwqB8X2wxhX9SoYz-swIA1_EgiTyD1ikur76Ae0O-vO3zeuCRSfVT1KEqMh_Ff2Le2iFuN11ALXSXDqbKclJ593kxuoAsHboA0464R-gc2mX7uWlR7iiIRYOix60dtu9ewq4uNPP8XCOcM1nKw
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Tight+I%2FO+Lower+Bound+for+Matrix+Multiplication&rft.au=Smith%2C+Tyler+Michael&rft.au=Lowery%2C+Bradley&rft.au=Langou%2C+Julien&rft.au=van+de+Geijn%2C+Robert+A&rft.date=2017-02-03&rft_id=info:doi/10.48550%2Farxiv.1702.02017&rft.externalDocID=1702_02017