Detecting Silent Data Corruption from Hardware Counters

Silent Data Corruptions (SDCs), which can manifest at the application level despite extensive screening and testing, can disrupt meaningful scientific interpretation, thereby necessitating robust monitoring tools capable of detecting them. While prior approaches have demonstrated competitive detecti...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE International Conference on Cluster Computing pp. 1 - 13
Main Authors Choi, Minseop, Azzaoui, Taha, Chaisson, Kyle, Arias, Orlando, Son, Seung Woo
Format Conference Proceeding
LanguageEnglish
Published IEEE 02.09.2025
Subjects
Online AccessGet full text
ISSN2168-9253
DOI10.1109/CLUSTER59342.2025.11186479

Cover

More Information
Summary:Silent Data Corruptions (SDCs), which can manifest at the application level despite extensive screening and testing, can disrupt meaningful scientific interpretation, thereby necessitating robust monitoring tools capable of detecting them. While prior approaches have demonstrated competitive detection performance, they often require nontrivial modifications to algorithms or prior knowledge, such as spatial or temporal data patterns, to make those approaches effective. Furthermore, the error model through standard random bit flips may not reflect realistic scenarios, potentially including relatively easy-to-detect cases with obvious deviations. In this work, we study SDCs and their effects on sparse matrix computations, prevalent kernels in many scientific applications, using hardware counters, which could serve as a holistic indicator of revealing program behavior changes due to SDCs. We experiment with a set of sparse matrix benchmarks using a method that simulates data corruption to varying degrees based on our extensive analysis of error propagation, creating realistic SDC occurrences at the application level. We detail the process of sampling hardware performance counters with minimal disturbance. Using the collected hardware counters, we train various classes of classifiers, including standard ML, neural-network-based, and unsupervised, to accurately detect SDCs. Our experimental evaluations through k-fold crossvalidation indicate that hardware counters can effectively detect the presence of SDCs with a low false positive rate, incurring comparable training overheads and minor inference overhead compared to the state-of-the-art. Our approach achieves a competitive average recall ( >0.91 ) with a realistic error rate based on the observed error propagation and low runtime overhead ( <2 \% ) while avoiding program modifications.
ISSN:2168-9253
DOI:10.1109/CLUSTER59342.2025.11186479