Snorkel: rapid training data creation with weak supervision

Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitr...

Full description

Saved in:
Bibliographic Details
Published inThe VLDB journal Vol. 29; no. 2-3; pp. 709 - 730
Main Authors Ratner, Alexander, Bach, Stephen H., Ehrenberg, Henry, Fries, Jason, Wu, Sen, Ré, Christopher
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.05.2020
Springer Nature B.V
Springer Science + Business Media
Subjects
Online AccessGet full text
ISSN1066-8888
0949-877X
0949-877X
DOI10.1007/s00778-019-00552-1

Cover

More Information
Summary:Labeling training data is increasingly the largest bottleneck in deploying machine learning systems. We present Snorkel, a first-of-its-kind system that enables users to train state-of-the-art models without hand labeling any training data. Instead, users write labeling functions that express arbitrary heuristics, which can have unknown accuracies and correlations. Snorkel denoises their outputs without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experience over the past year collaborating with companies, agencies, and research laboratories. In a user study, subject matter experts build models 2.8 × faster and increase predictive performance an average 45.5 % versus seven hours of hand labeling. We study the modeling trade-offs in this new setting and propose an optimizer for automating trade-off decisions that gives up to 1.8 × speedup per pipeline execution. In two collaborations, with the US Department of Veterans Affairs and the US Food and Drug Administration, and on four open-source text and image data sets representative of other deployments, Snorkel provides 132 % average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60 % of the predictive performance of large hand-curated training sets.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
USDOE
108845
ISSN:1066-8888
0949-877X
0949-877X
DOI:10.1007/s00778-019-00552-1