A Tight I/O Lower Bound for Matrix Multiplication

A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ ,...

Full description

Saved in:
Bibliographic Details
Main Authors Smith, Tyler Michael, Lowery, Bradley, Langou, Julien, van de Geijn, Robert A
Format Journal Article
LanguageEnglish
Published 03.02.2017
Subjects
Online AccessGet full text
DOI10.48550/arxiv.1702.02017

Cover

Abstract A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ , where each segment is a series of operations involving$M$reads and writes to and from fast memory, and$M$is the size of fast memory. A lower bound on the number of segments was then determined by obtaining an upper bound on the number of elementary multiplications performed per segment. This paper follows the same high level approach, but improves the lower bound by (1) transforming algorithms for MMM so that they perform all computation via fused multiply-add instructions (FMAs) and using this to reason about only the cost associated with reading the matrices, and (2) decoupling the per-segment I/O cost from the size of fast memory. For$n \times n$matrices, the lower bound's leading-order term is$2n^3/\sqrt{M}$ . A theoretical algorithm whose leading terms attains this is introduced. To what extent the state-of-the-art Goto's Algorithm attains the lower bound is discussed.
AbstractList A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior work obtained weaker lower bounds by reasoning about the number of segments needed to perform$C:=AB$ , for distinct matrices$A$ ,$B$ , and$C$ , where each segment is a series of operations involving$M$reads and writes to and from fast memory, and$M$is the size of fast memory. A lower bound on the number of segments was then determined by obtaining an upper bound on the number of elementary multiplications performed per segment. This paper follows the same high level approach, but improves the lower bound by (1) transforming algorithms for MMM so that they perform all computation via fused multiply-add instructions (FMAs) and using this to reason about only the cost associated with reading the matrices, and (2) decoupling the per-segment I/O cost from the size of fast memory. For$n \times n$matrices, the lower bound's leading-order term is$2n^3/\sqrt{M}$ . A theoretical algorithm whose leading terms attains this is introduced. To what extent the state-of-the-art Goto's Algorithm attains the lower bound is discussed.
Author Langou, Julien
Smith, Tyler Michael
Lowery, Bradley
van de Geijn, Robert A
Author_xml – sequence: 1
  givenname: Tyler Michael
  surname: Smith
  fullname: Smith, Tyler Michael
– sequence: 2
  givenname: Bradley
  surname: Lowery
  fullname: Lowery, Bradley
– sequence: 3
  givenname: Julien
  surname: Langou
  fullname: Langou, Julien
– sequence: 4
  givenname: Robert A
  surname: van de Geijn
  fullname: van de Geijn, Robert A
BackLink https://doi.org/10.48550/arXiv.1702.02017$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzQ3MNIzMDIwNOdkMHRUCMlMzyhR8NT3V_DJL08tUnDKL81LUUjLL1LwTSwpyqxQ8C3NKcksyMlMTizJzM_jYWBNS8wpTuWF0twM8m6uIc4eumDD4wuKMnMTiyrjQZbEgy0xJqwCABnnMdI
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.1702.02017
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 1702_02017
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_1702_020173
IEDL.DBID GOX
IngestDate Tue Sep 30 19:10:47 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_1702_020173
OpenAccessLink https://arxiv.org/abs/1702.02017
ParticipantIDs arxiv_primary_1702_02017
PublicationCentury 2000
PublicationDate 2017-02-03
PublicationDateYYYYMMDD 2017-02-03
PublicationDate_xml – month: 02
  year: 2017
  text: 2017-02-03
  day: 03
PublicationDecade 2010
PublicationYear 2017
Score 3.233282
SecondaryResourceType preprint
Snippet A tight lower bound for required I/O when computing an ordinary matrix-matrix multiplication on a processor with two layers of memory is established. Prior...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computational Complexity
Title A Tight I/O Lower Bound for Matrix Multiplication
URI https://arxiv.org/abs/1702.02017
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NawIxEB3UUy-lxRZrbZ1Dr0FjssoetVSsuPViYW9LPqGXUqwWf76ZZMVePAWSIQPJ4b3JzLwAvAhtlLejCfNGGia9tSzXzjOhxnmmlJMmJtqLj_HiUy7LrGwAnnph1Pbw9Zf0gfXvgFOzVCA0fNKEZiAK1My7LlNyMkpx1fZnu8Ax49Q_kJjfwHXN7nCaruMWGu67DXyKGwqB8X2wxhX9SoYz-swIA1_EgiTyD1ikur76Ae0O-vO3zeuCRSfVT1KEqMh_Ff2Le2iFuN11ALXSXDqbKclJ593kxuoAsHboA0464R-gc2mX7uWlR7iiIRYOix60dtu9ewq4uNPP8XCOcM1nKw
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+Tight+I%2FO+Lower+Bound+for+Matrix+Multiplication&rft.au=Smith%2C+Tyler+Michael&rft.au=Lowery%2C+Bradley&rft.au=Langou%2C+Julien&rft.au=van+de+Geijn%2C+Robert+A&rft.date=2017-02-03&rft_id=info:doi/10.48550%2Farxiv.1702.02017&rft.externalDocID=1702_02017