Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-huma...
Saved in:
| Main Authors | , , |
|---|---|
| Format | Journal Article |
| Language | English |
| Published |
19.10.2023
|
| Subjects | |
| Online Access | Get full text |
| DOI | 10.48550/arxiv.2310.13257 |
Cover
| Summary: | Modern neural language models (LMs) are powerful tools for modeling human
sentence production and comprehension, and their internal representations are
remarkably well-aligned with representations of language in the human brain.
But to achieve these results, LMs must be trained in distinctly un-human-like
ways - requiring orders of magnitude more language data than children receive
during development, and without perceptual or social context. Do models trained
more naturalistically -- with grounded supervision -- exhibit more humanlike
language learning? We investigate this question in the context of word
learning, a key sub-task in language acquisition. We train a diverse set of LM
architectures, with and without auxiliary visual supervision, on datasets of
varying scales. We then evaluate these models' learning of syntactic
categories, lexical relations, semantic features, word similarity, and
alignment with human neural representations. We find that visual supervision
can indeed improve the efficiency of word learning. However, these improvements
are limited: they are present almost exclusively in the low-data regime, and
sometimes canceled out by the inclusion of rich distributional signals from
text. The information conveyed by text and images is not redundant -- models
mainly driven by visual information yield qualitatively different from those
mainly driven by word co-occurrences. However, our results suggest that current
multimodal modeling approaches fail to effectively leverage visual information
to build human-like word representations from human-scale data. |
|---|---|
| DOI: | 10.48550/arxiv.2310.13257 |