Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-huma...

Full description

Saved in:

Bibliographic Details
Main Authors	Zhuang, Chengxu, Fedorenko, Evelina, Andreas, Jacob
Format	Journal Article
Language	English
Published	19.10.2023
Subjects	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Online Access	Get full text
DOI	10.48550/arxiv.2310.13257

Cover

Abstract	Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
AbstractList	Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
Author	Fedorenko, Evelina Zhuang, Chengxu Andreas, Jacob
Author_xml	– sequence: 1 givenname: Chengxu surname: Zhuang fullname: Zhuang, Chengxu – sequence: 2 givenname: Evelina surname: Fedorenko fullname: Fedorenko, Evelina – sequence: 3 givenname: Jacob surname: Andreas fullname: Andreas, Jacob
BackLink	https://doi.org/10.48550/arXiv.2310.13257$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIGChgaG5maczLYhWUWlybmKLgX5ZfmpWTmpSt4pOYUFCv4pCYW5SmE5xelKPimJuYBJYoVMvMUfPLLdV0SSxIVglLTM3NTi3kYWNMSc4pTeaE0N4O8m2uIs4cu2Kb4gqLM3MSiyniQjfFgG40JqwAAWI020Q
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2310.13257
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2310_13257
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2310_132573
IEDL.DBID	GOX
IngestDate	Wed Jul 23 01:54:16 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2310_132573
OpenAccessLink	https://arxiv.org/abs/2310.13257
ParticipantIDs	arxiv_primary_2310_13257
PublicationCentury	2000
PublicationDate	2023-10-19
PublicationDateYYYYMMDD	2023-10-19
PublicationDate_xml	– month: 10 year: 2023 text: 2023-10-19 day: 19
PublicationDecade	2020
PublicationYear	2023
Score	3.7024627
SecondaryResourceType	preprint
Snippet	Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Artificial Intelligence Computer Science - Computation and Language
Title	Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
URI	https://arxiv.org/abs/2310.13257
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIRaW-5-A1urvZ50UoailiFcTH3pZkk5W91NLdqj_fmWxFL71OhmRIYJ58XwDO06qMmcZcZKVvRag9IzLtS2Ep9phYBlVgeaI7e4inL-FdHuU9wF8sjFp-158dP7BuLjn5uKB6KUr60KdEgcG8j3k3nHRUXGv9Pz3KMZ3oX5CY7MD2OrvDcfccQ-jZ-S5cvdbNiqTc5nEYEiRfv2jQUZviG1V_OLOK-xMN1nO8__gSN6pV-GTfGZ-xB2eT2-frqXAnFouOHqJgYwpnjNyHARXxdgRIsipNDMVXQ75JBmlkkiqU2no6K1XsHcBo0y6Hm5eOYIu_P2df6mfHMGiXK3tCQbLVp-6mfgBweGoO
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Visual+Grounding+Helps+Learn+Word+Meanings+in+Low-Data+Regimes&rft.au=Zhuang%2C+Chengxu&rft.au=Fedorenko%2C+Evelina&rft.au=Andreas%2C+Jacob&rft.date=2023-10-19&rft_id=info:doi/10.48550%2Farxiv.2310.13257&rft.externalDocID=2310_13257