Memory Dependence Speculation for Simultaneous Multi-Threading Processors

Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great b...

Full description

Saved in:

Bibliographic Details
Published in	Parallel processing letters Vol. 34; no. 2
Main Authors	Flores, Jonathan, Lin, Wei-Ming
Format	Journal Article
Language	English
Published	Singapore World Scientific Publishing Company 01.06.2024 World Scientific Publishing Co. Pte., Ltd
Subjects	Computer architecture Microprocessors Processors Read-only memory devices Workload Workloads speculation Multi-threading SMT
Online Access	Get full text
ISSN	0129-6264 1793-642X
DOI	10.1142/S0129626424500014

Cover

More Information
Summary:	Simultaneous Multi-Threading (SMT) processors provide improvement over the traditional out-of-order superscalar architecture by allowing instructions from several independent threads to execute out-of-order concurrently. Maintaining the accuracy of values read from and written to memory is a great bottleneck for processor performance, as loads must stall execution until all prior store addresses are known or risk reading invalid data. Prior research in this area has mainly focused on superscalar architecture, as such, it is only natural to extend memory dependence speculation techniques to an SMT architecture. In this paper, we allow for loads among threads to execute as soon as their addresses are resolved without checking for prior memory address conflicts. Stores also perform a check on all later loads to see if any read was too early due to an address match, if so, the processor state is recovered, and the load re-issued. This aggressive technique allows for the greatest potential instructions per clock cycle gains over predictive techniques as the pipeline is never stalled for loads. Our simulations show that an overall IPC gain up to 12% and 10% is possible for both 4-threaded and 8-threaded workloads respectively. Conversely, a maximum overall IPC loss of at least 2.3% and 2% for 4-threaded and 8-threaded workloads respectively was also observed.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0129-6264 1793-642X
DOI:	10.1142/S0129626424500014