A systematic literature review on benchmarks for evaluating debugging approaches
Bug benchmarks are used in development and evaluation of debugging approaches, e.g. fault localization and automated repair. Quantitative performance comparison of different debugging approaches is only possible when they have been evaluated on the same dataset or benchmark. However, benchmarks are...
Saved in:
| Published in | The Journal of systems and software Vol. 192; p. 111423 |
|---|---|
| Main Authors | , |
| Format | Journal Article |
| Language | English |
| Published |
Elsevier Inc
01.10.2022
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0164-1212 1873-1228 |
| DOI | 10.1016/j.jss.2022.111423 |
Cover
| Summary: | Bug benchmarks are used in development and evaluation of debugging approaches, e.g. fault localization and automated repair. Quantitative performance comparison of different debugging approaches is only possible when they have been evaluated on the same dataset or benchmark. However, benchmarks are often specialized towards usage for certain debugging approaches in their contained data, metrics, and artifacts. Such benchmarks cannot be easily used on debugging approaches outside their scope as such approach may rely on specific data such as bug reports or code metrics that are not included in the dataset. Furthermore, benchmarks vary in their size w.r.t. the number of subject programs and the size of the individual subject programs. For these reasons, we have performed a systematic literature review where we have identified 73 benchmarks that can be used to evaluate debugging approaches. We compare the different benchmarks w.r.t. their size and the provided information such as bug reports, contained test cases, and other code metrics. This comparison is intended to help researchers to quickly identify all suitable benchmarks for evaluating their specific debugging approaches. Furthermore, we discuss reoccurring issues and challenges in selection, acquisition, and usage of such bug benchmarks, i.e., data availability, data quality, duplicated content, data formats, reproducibility, and extensibility.
Editor’s note: Open Science material was validated by the Journal of Systems and Software Open Science Board.
•Identification of 73 benchmarks for evaluating debugging approaches.•Comparison of benchmarks w.r.t. size, language, and realization of FAIR criteria.•Discussion of problems in benchmarks w.r.t. data availability, quality, and formats.
[Display omitted] |
|---|---|
| ISSN: | 0164-1212 1873-1228 |
| DOI: | 10.1016/j.jss.2022.111423 |