Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped fro...

Full description

Saved in:
Bibliographic Details
Published inProceedings / IEEE Workshop on Applications of Computer Vision pp. 5551 - 5561
Main Authors Sahin, Ugur, Li, Hang, Khan, Qadeer, Cremers, Daniel, Tresp, Volker
Format Conference Proceeding
LanguageEnglish
Published IEEE 03.01.2024
Subjects
Online AccessGet full text
ISSN2642-9381
DOI10.1109/WACV57701.2024.00547

Cover

Abstract Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
AbstractList Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
Author Tresp, Volker
Sahin, Ugur
Li, Hang
Cremers, Daniel
Khan, Qadeer
Author_xml – sequence: 1
  givenname: Ugur
  surname: Sahin
  fullname: Sahin, Ugur
  organization: Technical University of Munich
– sequence: 2
  givenname: Hang
  surname: Li
  fullname: Li, Hang
  organization: LMU Munich
– sequence: 3
  givenname: Qadeer
  surname: Khan
  fullname: Khan, Qadeer
  organization: Technical University of Munich
– sequence: 4
  givenname: Daniel
  surname: Cremers
  fullname: Cremers, Daniel
  organization: Technical University of Munich
– sequence: 5
  givenname: Volker
  surname: Tresp
  fullname: Tresp, Volker
  organization: LMU Munich
BookMark eNotjN1OwjAYhqvRREDugIPewPDrz9b1kCyIJqCJUTwkpfs6akZL6Kbx7oXg0fuT932G5CbEgIRMGEwZA_3wOavWuVLAphy4nALkUl2RsVa6FDkIVmoO12TAC8kzLUp2R4YpfQEIzbQYEDcPOxOsDw1d9W3n97E2La3i_hCT73wMp_SGJsVwnkRH1z71p25pQtObBukq1tgm-uO7HV1gwKPp_DfSF2wuZuXPz3ty60ybcPyvI_LxOH-vnrLl6-K5mi0zz0F2mWUMuZZgt8aBtcwWNctdrrkyCrk1W-50zpE5qWqpjag5VwXm6AxCAY6LEZlcuB4RN4ej35vj74aBLOWJK_4Af45bDw
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/WACV57701.2024.00547
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
Accès UT - IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350318920
EISSN 2642-9381
EndPage 5561
ExternalDocumentID 10484294
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IM
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23
IEDL.DBID RIE
IngestDate Wed Aug 27 02:11:48 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23
PageCount 11
ParticipantIDs ieee_primary_10484294
PublicationCentury 2000
PublicationDate 2024-Jan.-3
PublicationDateYYYYMMDD 2024-01-03
PublicationDate_xml – month: 01
  year: 2024
  text: 2024-Jan.-3
  day: 03
PublicationDecade 2020
PublicationTitle Proceedings / IEEE Workshop on Applications of Computer Vision
PublicationTitleAbbrev WACV
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0039193
Score 2.3607335
Snippet Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text...
SourceID ieee
SourceType Publisher
StartPage 5551
SubjectTerms Algorithms
and algorithms
Codes
Cognition
Computer vision
formulations
Image recognition and understanding
Machine learning architectures
Pipelines
Self-supervised learning
Training
Vision + language and/or other modalities
Visualization
Title Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
URI https://ieeexplore.ieee.org/document/10484294
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJ0_zY-I3OXjtbJO0aY4yNoa4IeLmbiNNXuZwtuLai3-9SdopDARvSS4JeeR95f3eD6EbYZIstbFsoMHYAMWaoCBV2kGfmKGCpZp5SpbxJBlN2f08njdgdY-FAQBffAY9N_R_-bpQlUuV2RfOUqs_WQu1eJrUYK2t2qXCuiINNi4Kxe3LXX8Wcx66GJC4DtnxDoOKNyDDDppst67rRt56VZn11NdOV8Z_n-0AdX-xevjxxwodoj3Ij1CncS5x83Q3x8gM8lfXWyNfYg-6fS-0XGOnDpqyLTt7Arnx6VlcGDxbbSq79tAkNLFjTVtvsEvc4rpZtdOUeALLejD2VBNdNB0OnvujoCFZCFYkZGWgogjssUOVSRMqFalER7GJBeGSA1EyI0bEBCLDuGZCUk0ITyAGIyFMQkPoCWrnRQ6nCOtEUCmtPwgyZZprEVn5m5QomsWMGnmGuu7eFh91H43F9srO_1i_QPtOdj7hQS9Ru_ys4Mq6AGV27UX_DcVjsuU
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8IwGG4UD3rCD4zf9uB1uHXtth4NgaACMQaQG-nat0jEYWS7-Ottu6EJiYm3rpc1Xft-7X2eB6EbrqM0Mbmsp0CbBMW4IC-RykKfqA45TRR1kiz9QdQd0YcJm1RgdYeFAQDXfAZNO3T_8tVSFrZUZm44TYz9pNtoh1FKWQnXWhvekJtgpELHBT6_fblrjVkc-zYLJJYjm21oqDgX0qmjwfrlZefIW7PI06b82uBl_Pfq9lHjF62Hn3780AHaguwQ1avwEleXd3WEdDt7tewa2Qw72O37UokFtgahatwyT88gVq5Ai5caj-erwsz1qpImtrppixW2pVtc0lVbW4kHMCsHfSc20UCjTnvY6nqVzII3Jz7NPRkEYJbty1RoX8pARipgmnESixiIFCnRnBEINI0V5SJUhMQRMNAC_MjXJDxGtWyZwQnCKuKhECYiBJFQFSsemBOgEyLDlNFQi1PUsPs2_SiZNKbrLTv7Y_4a7XaH_d60dz94PEd79ju68kd4gWr5ZwGXJiDI0yt3DL4BThW2Mg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+Workshop+on+Applications+of+Computer+Vision&rft.atitle=Enhancing+Multimodal+Compositional+Reasoning+of+Visual+Language+Models+with+Generative+Negative+Mining&rft.au=Sahin%2C+Ugur&rft.au=Li%2C+Hang&rft.au=Khan%2C+Qadeer&rft.au=Cremers%2C+Daniel&rft.date=2024-01-03&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=5551&rft.epage=5561&rft_id=info:doi/10.1109%2FWACV57701.2024.00547&rft.externalDocID=10484294