FlakiMe: Laboratory-Controlled Test Flakiness Impact Assessment. A Case Study on Mutation Testing and Program Repair
Much research on software testing makes an implicit assumption that test failures are deterministic such that they always witness the presence of the same defects. However, this assumption is not always true because some test failures are due to so-called flaky tests, i.e., tests with non-determinis...
        Saved in:
      
    
          | Main Authors | , , , | 
|---|---|
| Format | Journal Article | 
| Language | English | 
| Published | 
          
        06.12.2019
     | 
| Subjects | |
| Online Access | Get full text | 
| DOI | 10.48550/arxiv.1912.03197 | 
Cover
| Summary: | Much research on software testing makes an implicit assumption that test
failures are deterministic such that they always witness the presence of the
same defects. However, this assumption is not always true because some test
failures are due to so-called flaky tests, i.e., tests with non-deterministic
outcomes. Unfortunately, flaky tests have major implications for testing and
test-dependent activities such as mutation testing and automated program
repair. To deal with this issue, we introduce a test flakiness assessment and
experimentation platform, called FlakiMe, that supports the seeding of a
(controllable) degree of flakiness into the behaviour of a given test suite.
Thereby, FlakiMe equips researchers with ways to investigate the impact of test
flakiness on their techniques under laboratory-controlled conditions. We use
FlakiME to report results and insights from case studies that assesses the
impact of flakiness on mutation testing and program repair. These results
indicate that a 5% of flakiness failures is enough to affect the mutation
score, but the effect size is modest (2% - 4% ), while it completely
annihilates the ability of program repair to patch 50% of the subject programs.
We also observe that flakiness has case-specific effects, which mainly disrupts
the repair of bugs that are covered by many tests. Moreover, we find that a
minimal amount of user feedback is sufficient for alleviating the effects of
flakiness. | 
|---|---|
| DOI: | 10.48550/arxiv.1912.03197 |