Detecting Silent Data Corruption from Hardware Counters

Silent Data Corruptions (SDCs), which can manifest at the application level despite extensive screening and testing, can disrupt meaningful scientific interpretation, thereby necessitating robust monitoring tools capable of detecting them. While prior approaches have demonstrated competitive detecti...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE International Conference on Cluster Computing pp. 1 - 13
Main Authors	Choi, Minseop, Azzaoui, Taha, Chaisson, Kyle, Arias, Orlando, Son, Seung Woo
Format	Conference Proceeding
Language	English
Published	IEEE 02.09.2025
Subjects	Classification algorithms Error Injection Hardware Kernel Monitoring Performance Monitoring Counters Runtime Silent Data Corruption Sparse matrices Sparse Matrix-Vector Multiplication Spatial databases Symmetric matrices Training
Online Access	Get full text
ISSN	2168-9253
DOI	10.1109/CLUSTER59342.2025.11186479

Cover

More Information
Summary:	Silent Data Corruptions (SDCs), which can manifest at the application level despite extensive screening and testing, can disrupt meaningful scientific interpretation, thereby necessitating robust monitoring tools capable of detecting them. While prior approaches have demonstrated competitive detection performance, they often require nontrivial modifications to algorithms or prior knowledge, such as spatial or temporal data patterns, to make those approaches effective. Furthermore, the error model through standard random bit flips may not reflect realistic scenarios, potentially including relatively easy-to-detect cases with obvious deviations. In this work, we study SDCs and their effects on sparse matrix computations, prevalent kernels in many scientific applications, using hardware counters, which could serve as a holistic indicator of revealing program behavior changes due to SDCs. We experiment with a set of sparse matrix benchmarks using a method that simulates data corruption to varying degrees based on our extensive analysis of error propagation, creating realistic SDC occurrences at the application level. We detail the process of sampling hardware performance counters with minimal disturbance. Using the collected hardware counters, we train various classes of classifiers, including standard ML, neural-network-based, and unsupervised, to accurately detect SDCs. Our experimental evaluations through k-fold crossvalidation indicate that hardware counters can effectively detect the presence of SDCs with a low false positive rate, incurring comparable training overheads and minor inference overhead compared to the state-of-the-art. Our approach achieves a competitive average recall ( >0.91 ) with a realistic error rate based on the observed error propagation and low runtime overhead ( <2 \% ) while avoiding program modifications.
ISSN:	2168-9253
DOI:	10.1109/CLUSTER59342.2025.11186479