Visual Grounding Helps Learn Word Meanings in Low-Data Regimes

Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-huma...

Full description

Saved in:
Bibliographic Details
Main Authors Zhuang, Chengxu, Fedorenko, Evelina, Andreas, Jacob
Format Journal Article
LanguageEnglish
Published 19.10.2023
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2310.13257

Cover

Abstract Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
AbstractList Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are remarkably well-aligned with representations of language in the human brain. But to achieve these results, LMs must be trained in distinctly un-human-like ways - requiring orders of magnitude more language data than children receive during development, and without perceptual or social context. Do models trained more naturalistically -- with grounded supervision -- exhibit more humanlike language learning? We investigate this question in the context of word learning, a key sub-task in language acquisition. We train a diverse set of LM architectures, with and without auxiliary visual supervision, on datasets of varying scales. We then evaluate these models' learning of syntactic categories, lexical relations, semantic features, word similarity, and alignment with human neural representations. We find that visual supervision can indeed improve the efficiency of word learning. However, these improvements are limited: they are present almost exclusively in the low-data regime, and sometimes canceled out by the inclusion of rich distributional signals from text. The information conveyed by text and images is not redundant -- models mainly driven by visual information yield qualitatively different from those mainly driven by word co-occurrences. However, our results suggest that current multimodal modeling approaches fail to effectively leverage visual information to build human-like word representations from human-scale data.
Author Fedorenko, Evelina
Zhuang, Chengxu
Andreas, Jacob
Author_xml – sequence: 1
  givenname: Chengxu
  surname: Zhuang
  fullname: Zhuang, Chengxu
– sequence: 2
  givenname: Evelina
  surname: Fedorenko
  fullname: Fedorenko, Evelina
– sequence: 3
  givenname: Jacob
  surname: Andreas
  fullname: Andreas, Jacob
BackLink https://doi.org/10.48550/arXiv.2310.13257$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzIGChgaG5maczLYhWUWlybmKLgX5ZfmpWTmpSt4pOYUFCv4pCYW5SmE5xelKPimJuYBJYoVMvMUfPLLdV0SSxIVglLTM3NTi3kYWNMSc4pTeaE0N4O8m2uIs4cu2Kb4gqLM3MSiyniQjfFgG40JqwAAWI020Q
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2310.13257
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2310_13257
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2310_132573
IEDL.DBID GOX
IngestDate Wed Jul 23 01:54:16 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2310_132573
OpenAccessLink https://arxiv.org/abs/2310.13257
ParticipantIDs arxiv_primary_2310_13257
PublicationCentury 2000
PublicationDate 2023-10-19
PublicationDateYYYYMMDD 2023-10-19
PublicationDate_xml – month: 10
  year: 2023
  text: 2023-10-19
  day: 19
PublicationDecade 2020
PublicationYear 2023
Score 3.7024627
SecondaryResourceType preprint
Snippet Modern neural language models (LMs) are powerful tools for modeling human sentence production and comprehension, and their internal representations are...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Title Visual Grounding Helps Learn Word Meanings in Low-Data Regimes
URI https://arxiv.org/abs/2310.13257
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LSwMxEB7anryIRaW-5-A1urvZ50UoailiFcTH3pZkk5W91NLdqj_fmWxFL71OhmRIYJ58XwDO06qMmcZcZKVvRag9IzLtS2Ep9phYBlVgeaI7e4inL-FdHuU9wF8sjFp-158dP7BuLjn5uKB6KUr60KdEgcG8j3k3nHRUXGv9Pz3KMZ3oX5CY7MD2OrvDcfccQ-jZ-S5cvdbNiqTc5nEYEiRfv2jQUZviG1V_OLOK-xMN1nO8__gSN6pV-GTfGZ-xB2eT2-frqXAnFouOHqJgYwpnjNyHARXxdgRIsipNDMVXQ75JBmlkkiqU2no6K1XsHcBo0y6Hm5eOYIu_P2df6mfHMGiXK3tCQbLVp-6mfgBweGoO
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Visual+Grounding+Helps+Learn+Word+Meanings+in+Low-Data+Regimes&rft.au=Zhuang%2C+Chengxu&rft.au=Fedorenko%2C+Evelina&rft.au=Andreas%2C+Jacob&rft.date=2023-10-19&rft_id=info:doi/10.48550%2Farxiv.2310.13257&rft.externalDocID=2310_13257