Findings of the BabyLM Challenge: Sample-Efficient Pretraining on Developmentally Plausible Corpora
2023. In Proceedings of the BabyLM Challenge at the 27th Conference on Computational Natural Language Learning, pages 1-34, Singapore. Association for Computational Linguistics Children can acquire language from less than 100 million words of input. Large language models are far less data-efficient:...
Saved in:
| Main Authors | , , , , , , , , , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
10.04.2025
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2504.08165 |
Cover
| Summary: | 2023. In Proceedings of the BabyLM Challenge at the 27th
Conference on Computational Natural Language Learning, pages 1-34, Singapore.
Association for Computational Linguistics Children can acquire language from less than 100 million words of input.
Large language models are far less data-efficient: they typically require 3 or
4 orders of magnitude more data and still do not perform as well as humans on
many evaluations. These intensive resource demands limit the ability of
researchers to train new models and use existing models as developmentally
plausible cognitive models. The BabyLM Challenge is a communal effort in which
participants compete to optimize language model training on a fixed data
budget. Submissions are compared on various evaluation tasks targeting
grammatical ability, downstream task performance, and generalization.
Participants can submit to up to three tracks with progressively looser data
restrictions. From over 30 submissions, we extract concrete recommendations on
how best to train data-efficient language models, and on where future efforts
should (and perhaps should not) focus. The winning submissions using the
LTG-BERT architecture (Samuel et al., 2023) outperformed models trained on
trillions of words. Other submissions achieved strong results through training
on shorter input sequences or training a student model on a pretrained teacher.
Curriculum learning attempts, which accounted for a large number of
submissions, were largely unsuccessful, though some showed modest improvements. |
|---|---|
| DOI: | 10.48550/arxiv.2504.08165 |