Memory Dependence Speculation for Simultaneous Multi-Threading Processors

Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great b...

Full description

Saved in:
Bibliographic Details
Published inParallel processing letters Vol. 34; no. 2
Main Authors Flores, Jonathan, Lin, Wei-Ming
Format Journal Article
LanguageEnglish
Published Singapore World Scientific Publishing Company 01.06.2024
World Scientific Publishing Co. Pte., Ltd
Subjects
Online AccessGet full text
ISSN0129-6264
1793-642X
DOI10.1142/S0129626424500014

Cover

More Information
Summary:Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great bottleneck for processor performance, as loads must stall execution until all prior store addresses are known or risk reading invalid data. Prior research in this area has mainly focused on superscalar architecture, as such, it is only natural to extend memory dependence speculation techniques to an SMT architecture. In this paper, we allow for loads among threads to execute as soon as their addresses are resolved without checking for prior memory address conflicts. Stores also perform a check on all later loads to see if any read was too early due to an address match, if so, the processor state is recovered, and the load re-issued. This aggressive technique allows for the greatest potential instructions per clock cycle gains over predictive techniques as the pipeline is never stalled for loads. Our simulations show that an overall IPC gain up to 12% and 10% is possible for both 4-threaded and 8-threaded workloads respectively. Conversely, a maximum overall IPC loss of at least 2.3% and 2% for 4-threaded and 8-threaded workloads respectively was also observed.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0129-6264
1793-642X
DOI:10.1142/S0129626424500014