Bigger is not always better: The importance of human-scale language modeling for psycholinguistics
When trained to place high probability on a training corpus, neural network language models can learn a surprising amount about language. Recent work has demonstrated that large performance improvements can arise from simply increasing, i.e., scaling, the size of the corpora they are trained on and...
Saved in:
| Published in | Journal of memory and language Vol. 144; p. 104650 |
|---|---|
| Main Authors | , , , , , , , , |
| Format | Journal Article |
| Language | English |
| Published |
Elsevier Inc
01.10.2025
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 0749-596X 1096-0821 |
| DOI | 10.1016/j.jml.2025.104650 |
Cover
| Summary: | When trained to place high probability on a training corpus, neural network language models can learn a surprising amount about language. Recent work has demonstrated that large performance improvements can arise from simply increasing, i.e., scaling, the size of the corpora they are trained on and the number of parameters in those models. Accordingly, many contemporary systems are trained on trillions of words. While largely beneficial to performance on language applications, scaling has several downsides for both computational psycholinguistics and natural language processing research. We discuss the scientific challenges presented by the scaling paradigm, as well as the benefits that would result from language models that can learn from human-scale data. In the second half of this paper, we report on findings from a recent effort to bring about human-scale language model pretraining: the first iteration of the BabyLM Challenge, a shared task organized by the authors that invited participants to train a language model on 100 million words or less. The challenge produced several concrete best practices for practitioners interested in small-scale language modeling. For cognitive scientists, the challenge demonstrated that robust linguistic generalizations can be learned by models trained on a human-scale dataset, though this is not yet achieved through cognitively plausible mechanisms. Furthermore, it established a population of “BabyLMs” that are all effective at data-efficient language learning. Studying such models can help us identify hypotheses for the computational mechanisms that underlie human language acquisition.
•Psycholinguistics benefits from computational models trained at human data scale.•We report on the BabyLM Challenge, an effort to train models at human scale.•BabyLM models achieve close to human-level performance on some tasks.•High language modeling performance is attainable with academic computational resources.•We identify actionable insights for human-scale language modeling. |
|---|---|
| ISSN: | 0749-596X 1096-0821 |
| DOI: | 10.1016/j.jml.2025.104650 |