Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining

Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped fro...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings / IEEE Workshop on Applications of Computer Vision pp. 5551 - 5561
Main Authors	Sahin, Ugur, Li, Hang, Khan, Qadeer, Cremers, Daniel, Tresp, Volker
Format	Conference Proceeding
Language	English
Published	IEEE 03.01.2024
Subjects	Algorithms and algorithms Codes Cognition Computer vision formulations Image recognition and understanding Machine learning architectures Pipelines Self-supervised learning Training Vision + language and/or other modalities Visualization
Online Access	Get full text
ISSN	2642-9381
DOI	10.1109/WACV57701.2024.00547

Cover

Abstract	Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
AbstractList	Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html.
Author	Tresp, Volker Sahin, Ugur Li, Hang Cremers, Daniel Khan, Qadeer
Author_xml	– sequence: 1 givenname: Ugur surname: Sahin fullname: Sahin, Ugur organization: Technical University of Munich – sequence: 2 givenname: Hang surname: Li fullname: Li, Hang organization: LMU Munich – sequence: 3 givenname: Qadeer surname: Khan fullname: Khan, Qadeer organization: Technical University of Munich – sequence: 4 givenname: Daniel surname: Cremers fullname: Cremers, Daniel organization: Technical University of Munich – sequence: 5 givenname: Volker surname: Tresp fullname: Tresp, Volker organization: LMU Munich
BookMark	eNotjN1OwjAYhqvRREDugIPewPDrz9b1kCyIJqCJUTwkpfs6akZL6Kbx7oXg0fuT932G5CbEgIRMGEwZA_3wOavWuVLAphy4nALkUl2RsVa6FDkIVmoO12TAC8kzLUp2R4YpfQEIzbQYEDcPOxOsDw1d9W3n97E2La3i_hCT73wMp_SGJsVwnkRH1z71p25pQtObBukq1tgm-uO7HV1gwKPp_DfSF2wuZuXPz3ty60ybcPyvI_LxOH-vnrLl6-K5mi0zz0F2mWUMuZZgt8aBtcwWNctdrrkyCrk1W-50zpE5qWqpjag5VwXm6AxCAY6LEZlcuB4RN4ej35vj74aBLOWJK_4Af45bDw
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/WACV57701.2024.00547
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings Accès UT - IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences
EISBN	9798350318920
EISSN	2642-9381
EndPage	5561
ExternalDocumentID	10484294
Genre	orig-research
GroupedDBID	6IE 6IF 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS
ID	FETCH-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:11:48 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23
PageCount	11
ParticipantIDs	ieee_primary_10484294
PublicationCentury	2000
PublicationDate	2024-Jan.-3
PublicationDateYYYYMMDD	2024-01-03
PublicationDate_xml	– month: 01 year: 2024 text: 2024-Jan.-3 day: 03
PublicationDecade	2020
PublicationTitle	Proceedings / IEEE Workshop on Applications of Computer Vision
PublicationTitleAbbrev	WACV
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0039193
Score	2.3607335
Snippet	Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text...
SourceID	ieee
SourceType	Publisher
StartPage	5551
SubjectTerms	Algorithms and algorithms Codes Cognition Computer vision formulations Image recognition and understanding Machine learning architectures Pipelines Self-supervised learning Training Vision + language and/or other modalities Visualization
Title	Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
URI	https://ieeexplore.ieee.org/document/10484294
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJ0_zY-I3OXjtbJO0aY4yNoa4IeLmbiNNXuZwtuLai3-9SdopDARvSS4JeeR95f3eD6EbYZIstbFsoMHYAMWaoCBV2kGfmKGCpZp5SpbxJBlN2f08njdgdY-FAQBffAY9N_R_-bpQlUuV2RfOUqs_WQu1eJrUYK2t2qXCuiINNi4Kxe3LXX8Wcx66GJC4DtnxDoOKNyDDDppst67rRt56VZn11NdOV8Z_n-0AdX-xevjxxwodoj3Ij1CncS5x83Q3x8gM8lfXWyNfYg-6fS-0XGOnDpqyLTt7Arnx6VlcGDxbbSq79tAkNLFjTVtvsEvc4rpZtdOUeALLejD2VBNdNB0OnvujoCFZCFYkZGWgogjssUOVSRMqFalER7GJBeGSA1EyI0bEBCLDuGZCUk0ITyAGIyFMQkPoCWrnRQ6nCOtEUCmtPwgyZZprEVn5m5QomsWMGnmGuu7eFh91H43F9srO_1i_QPtOdj7hQS9Ru_ys4Mq6AGV27UX_DcVjsuU
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8IwGG4UD3rCD4zf9uB1uHXtth4NgaACMQaQG-nat0jEYWS7-Ottu6EJiYm3rpc1Xft-7X2eB6EbrqM0Mbmsp0CbBMW4IC-RykKfqA45TRR1kiz9QdQd0YcJm1RgdYeFAQDXfAZNO3T_8tVSFrZUZm44TYz9pNtoh1FKWQnXWhvekJtgpELHBT6_fblrjVkc-zYLJJYjm21oqDgX0qmjwfrlZefIW7PI06b82uBl_Pfq9lHjF62Hn3780AHaguwQ1avwEleXd3WEdDt7tewa2Qw72O37UokFtgahatwyT88gVq5Ai5caj-erwsz1qpImtrppixW2pVtc0lVbW4kHMCsHfSc20UCjTnvY6nqVzII3Jz7NPRkEYJbty1RoX8pARipgmnESixiIFCnRnBEINI0V5SJUhMQRMNAC_MjXJDxGtWyZwQnCKuKhECYiBJFQFSsemBOgEyLDlNFQi1PUsPs2_SiZNKbrLTv7Y_4a7XaH_d60dz94PEd79ju68kd4gWr5ZwGXJiDI0yt3DL4BThW2Mg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+Workshop+on+Applications+of+Computer+Vision&rft.atitle=Enhancing+Multimodal+Compositional+Reasoning+of+Visual+Language+Models+with+Generative+Negative+Mining&rft.au=Sahin%2C+Ugur&rft.au=Li%2C+Hang&rft.au=Khan%2C+Qadeer&rft.au=Cremers%2C+Daniel&rft.date=2024-01-03&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=5551&rft.epage=5561&rft_id=info:doi/10.1109%2FWACV57701.2024.00547&rft.externalDocID=10484294