Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining
Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped fro...
Saved in:
| Published in | Proceedings / IEEE Workshop on Applications of Computer Vision pp. 5551 - 5561 |
|---|---|
| Main Authors | , , , , |
| Format | Conference Proceeding |
| Language | English |
| Published |
IEEE
03.01.2024
|
| Subjects | |
| Online Access | Get full text |
| ISSN | 2642-9381 |
| DOI | 10.1109/WACV57701.2024.00547 |
Cover
| Abstract | Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html. |
|---|---|
| AbstractList | Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text understanding tasks. They are often trained in a contrastive manner on a large and diverse corpus of images and corresponding text captions scraped from the internet. Despite this, VLMs often struggle with compositional reasoning tasks which require a fine-grained understanding of the complex interactions of objects and their attributes. This failure can be attributed to two main factors: 1) Contrastive approaches have traditionally focused on mining negative examples from existing datasets. However, the mined negative examples might not be difficult for the model to discriminate from the positive. An alternative to mining would be negative sample generation 2) But existing generative approaches primarily focus on generating hard negative texts associated with a given image. Mining in the other direction, i.e., generating negative image samples associated with a given text has been ignored. To overcome both these limitations, we propose a framework that not only mines in both directions but also generates challenging negative samples in both modalities, i.e., images and texts. Leveraging these generative hard negative samples, we significantly enhance VLMs' performance in tasks involving multimodal compositional reasoning. Our code and dataset are released at https://ugorsahin.github.io/enhancing-multimodal-compositional-reasoning-of-vlm.html. |
| Author | Tresp, Volker Sahin, Ugur Li, Hang Cremers, Daniel Khan, Qadeer |
| Author_xml | – sequence: 1 givenname: Ugur surname: Sahin fullname: Sahin, Ugur organization: Technical University of Munich – sequence: 2 givenname: Hang surname: Li fullname: Li, Hang organization: LMU Munich – sequence: 3 givenname: Qadeer surname: Khan fullname: Khan, Qadeer organization: Technical University of Munich – sequence: 4 givenname: Daniel surname: Cremers fullname: Cremers, Daniel organization: Technical University of Munich – sequence: 5 givenname: Volker surname: Tresp fullname: Tresp, Volker organization: LMU Munich |
| BookMark | eNotjN1OwjAYhqvRREDugIPewPDrz9b1kCyIJqCJUTwkpfs6akZL6Kbx7oXg0fuT932G5CbEgIRMGEwZA_3wOavWuVLAphy4nALkUl2RsVa6FDkIVmoO12TAC8kzLUp2R4YpfQEIzbQYEDcPOxOsDw1d9W3n97E2La3i_hCT73wMp_SGJsVwnkRH1z71p25pQtObBukq1tgm-uO7HV1gwKPp_DfSF2wuZuXPz3ty60ybcPyvI_LxOH-vnrLl6-K5mi0zz0F2mWUMuZZgt8aBtcwWNctdrrkyCrk1W-50zpE5qWqpjag5VwXm6AxCAY6LEZlcuB4RN4ej35vj74aBLOWJK_4Af45bDw |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/WACV57701.2024.00547 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings Accès UT - IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Applied Sciences |
| EISBN | 9798350318920 |
| EISSN | 2642-9381 |
| EndPage | 5561 |
| ExternalDocumentID | 10484294 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IK 6IL 6IM 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS |
| ID | FETCH-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23 |
| IEDL.DBID | RIE |
| IngestDate | Wed Aug 27 02:11:48 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i204t-c11e2940cbaf0cc1c6d15f5927a7e2cab2f952e1f47d49a3d2276e5efae060f23 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_10484294 |
| PublicationCentury | 2000 |
| PublicationDate | 2024-Jan.-3 |
| PublicationDateYYYYMMDD | 2024-01-03 |
| PublicationDate_xml | – month: 01 year: 2024 text: 2024-Jan.-3 day: 03 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings / IEEE Workshop on Applications of Computer Vision |
| PublicationTitleAbbrev | WACV |
| PublicationYear | 2024 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0039193 |
| Score | 2.3607335 |
| Snippet | Contemporary large-scale visual language models (VLMs) exhibit strong representation capacities, making them ubiquitous for enhancing image and text... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 5551 |
| SubjectTerms | Algorithms and algorithms Codes Cognition Computer vision formulations Image recognition and understanding Machine learning architectures Pipelines Self-supervised learning Training Vision + language and/or other modalities Visualization |
| Title | Enhancing Multimodal Compositional Reasoning of Visual Language Models with Generative Negative Mining |
| URI | https://ieeexplore.ieee.org/document/10484294 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJ0_zY-I3OXjtbJO0aY4yNoa4IeLmbiNNXuZwtuLai3-9SdopDARvSS4JeeR95f3eD6EbYZIstbFsoMHYAMWaoCBV2kGfmKGCpZp5SpbxJBlN2f08njdgdY-FAQBffAY9N_R_-bpQlUuV2RfOUqs_WQu1eJrUYK2t2qXCuiINNi4Kxe3LXX8Wcx66GJC4DtnxDoOKNyDDDppst67rRt56VZn11NdOV8Z_n-0AdX-xevjxxwodoj3Ij1CncS5x83Q3x8gM8lfXWyNfYg-6fS-0XGOnDpqyLTt7Arnx6VlcGDxbbSq79tAkNLFjTVtvsEvc4rpZtdOUeALLejD2VBNdNB0OnvujoCFZCFYkZGWgogjssUOVSRMqFalER7GJBeGSA1EyI0bEBCLDuGZCUk0ITyAGIyFMQkPoCWrnRQ6nCOtEUCmtPwgyZZprEVn5m5QomsWMGnmGuu7eFh91H43F9srO_1i_QPtOdj7hQS9Ru_ys4Mq6AGV27UX_DcVjsuU |
| linkProvider | IEEE |
| linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8IwGG4UD3rCD4zf9uB1uHXtth4NgaACMQaQG-nat0jEYWS7-Ottu6EJiYm3rpc1Xft-7X2eB6EbrqM0Mbmsp0CbBMW4IC-RykKfqA45TRR1kiz9QdQd0YcJm1RgdYeFAQDXfAZNO3T_8tVSFrZUZm44TYz9pNtoh1FKWQnXWhvekJtgpELHBT6_fblrjVkc-zYLJJYjm21oqDgX0qmjwfrlZefIW7PI06b82uBl_Pfq9lHjF62Hn3780AHaguwQ1avwEleXd3WEdDt7tewa2Qw72O37UokFtgahatwyT88gVq5Ai5caj-erwsz1qpImtrppixW2pVtc0lVbW4kHMCsHfSc20UCjTnvY6nqVzII3Jz7NPRkEYJbty1RoX8pARipgmnESixiIFCnRnBEINI0V5SJUhMQRMNAC_MjXJDxGtWyZwQnCKuKhECYiBJFQFSsemBOgEyLDlNFQi1PUsPs2_SiZNKbrLTv7Y_4a7XaH_d60dz94PEd79ju68kd4gWr5ZwGXJiDI0yt3DL4BThW2Mg |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+Workshop+on+Applications+of+Computer+Vision&rft.atitle=Enhancing+Multimodal+Compositional+Reasoning+of+Visual+Language+Models+with+Generative+Negative+Mining&rft.au=Sahin%2C+Ugur&rft.au=Li%2C+Hang&rft.au=Khan%2C+Qadeer&rft.au=Cremers%2C+Daniel&rft.date=2024-01-03&rft.pub=IEEE&rft.eissn=2642-9381&rft.spage=5551&rft.epage=5561&rft_id=info:doi/10.1109%2FWACV57701.2024.00547&rft.externalDocID=10484294 |