Implicit data crimes Machine learning bias arising from misuse of public data

Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic resu...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the National Academy of Sciences - PNAS Vol. 119; no. 13; pp. 1 - 11
Main Authors	Shimron, Efrat, Tamir, Jonathan I., Wang, Ke, Lustig, Michael
Format	Journal Article
Language	English
Published	United States National Academy of Sciences 29.03.2022
Subjects	Algorithms Bias Big Data Computer Sciences Crime Data processing Deep learning Image Processing, Computer-Assisted Inverse problems Machine Learning Magnetic resonance imaging Physical Sciences Solvers inverse problem data crimes big data MRI bias
Online Access	Get full text
ISSN	0027-8424 1091-6490 1091-6490
DOI	10.1073/pnas.2117203119

Cover

More Information
Summary:	Although open databases are an important resource in the current deep learning (DL) era, they are sometimes used “off label”: Data published for one task are used to train algorithms for a different one. This work aims to highlight that this common practice may lead to biased, overly optimistic results. We demonstrate this phenomenon for inverse problem solvers and show how their biased performance stems from hidden data-processing pipelines. We describe two processing pipelines typical of open-access databases and study their effects on three well-established algorithms developed for MRI reconstruction: compressed sensing, dictionary learning, and DL. Our results demonstrate that all these algorithms yield systematically biased results when they are naively trained on seemingly appropriate data: The normalized rms error improves consistently with the extent of data processing, showing an artificial improvement of 25 to 48% in some cases. Because this phenomenon is not widely known, biased results sometimes are published as state of the art; we refer to that as implicit “data crimes.” This work hence aims to raise awareness regarding naive off-label usage of big data and reveal the vulnerability of modern inverse problem solvers to the resulting bias.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 Edited by David Donoho, Stanford University, Stanford, CA; received September 27, 2021; accepted February 1, 2022 Author contributions: E.S. and M.L. designed research; E.S. performed research; E.S., J.I.T., and K.W. wrote code; E.S. conducted experiments and analyzed data; and E.S., J.I.T., and M.L. wrote the paper.
ISSN:	0027-8424 1091-6490 1091-6490
DOI:	10.1073/pnas.2117203119