Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations

The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images ov...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 18; no. 3; pp. 507 - 520
Main Authors	Botao Wang, Dahua Lin, Hongkai Xiong, Zheng, Y. F.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.03.2016 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Bicycles conditional random field Image segmentation Internet object classification Object detection object localization Prediction algorithms Scene classification Semantics Visualization Conditional random field scene classification object classification object localization
Online Access	Get full text
ISSN	1520-9210 1941-0077
DOI	10.1109/TMM.2016.2520087

Cover

Abstract	The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images over the Internet are accompanied with descriptive texts, which are relevant to their contents. To bridge the gap between textual and visual analysis for image understanding, this paper presents an algorithm to learn the relations between scenes, objects, and texts with the help of image-level annotations. In particular, the relation between the texts and objects is modeled as the matching probability between the nouns and the object classes, which can be solved via a constrained bipartite matching problem. On the other hand, the relations between the scenes and objects/texts are modeled as the conditional distributions of their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings together scenes, objects, and texts for joint image understanding, including scene classification, object classification and localization, and the prediction of object cardinalities. The proposed cross-domain learning algorithm and the integrated model elevate the performance of image understanding for web images in the context of textual descriptions. Experimental results show that the proposed algorithm significantly outperforms conventional methods in various computer vision tasks.
AbstractList	The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on fine-grained annotations, such as bounding boxes and semantic segmentations, which are not available for web-scale images. In general, images over the Internet are accompanied with descriptive texts, which are relevant to their contents. To bridge the gap between textual and visual analysis for image understanding, this paper presents an algorithm to learn the relations between scenes, objects, and texts with the help of image-level annotations. In particular, the relation between the texts and objects is modeled as the matching probability between the nouns and the object classes, which can be solved via a constrained bipartite matching problem. On the other hand, the relations between the scenes and objects/texts are modeled as the conditional distributions of their co-occurrence. Built upon the learned cross-domain relations, an integrated model brings together scenes, objects, and texts for joint image understanding, including scene classification, object classification and localization, and the prediction of object cardinalities. The proposed cross-domain learning algorithm and the integrated model elevate the performance of image understanding for web images in the context of textual descriptions. Experimental results show that the proposed algorithm significantly outperforms conventional methods in various computer vision tasks.
Author	Zheng, Y. F. Botao Wang Hongkai Xiong Dahua Lin
Author_xml	– sequence: 1 surname: Botao Wang fullname: Botao Wang email: botaowang@sjtu.edu.cn organization: Dept. of Electron. Eng., Shanghai Jiao Tong Univ., Shanghai, China – sequence: 2 surname: Dahua Lin fullname: Dahua Lin email: dhlin@ie.cuhk.edu.hk organization: Dept. of Inf. Eng., Chinese Univ. of Hong Kong, Hong Kong, China – sequence: 3 surname: Hongkai Xiong fullname: Hongkai Xiong email: xionghongkai@sjtu.edu.cn organization: Dept. of Electron. Eng., Shanghai Jiao Tong Univ., Shanghai, China – sequence: 4 givenname: Y. F. surname: Zheng fullname: Zheng, Y. F. email: zheng@ece.osu.edu organization: Dept. of Electr. & Comput. Eng., Ohio State Univ., Columbus, OH, USA
BookMark	eNp9kE1LAzEQhoNUsK3eBS8Lnrcms-nO5iilaqVS0IoXYdmPiabUbE1S0H_vtls8ePA0mfA8eck7YD3bWGLsXPCREFxdLR8eRsBFOoIxcJ7hEesLJUXMOWKvPbe3sQLBT9jA-xXnQo459tnrfWNsiGZWkyNbUdToaFGuqAo-KmwdPVVkyUcvJrxHU61NZajF51Q4a-zbjl7SV4g7Jd7T0SOti2Aa60_ZsS7Wns4Oc8ieb6bLyV08X9zOJtfzuAIlQgw1qZJSACV5KhFQA6AYK4m1Qi1BUK3GAImSBWVSlIpTu5ZYYlqCriEZssvu3Y1rPrfkQ75qts62kbnADEFmwLOW4h1VucZ7RzrfOPNRuO9c8HzXYd52mO86zA8dtkr6R6lM2P8tuMKs_xMvOtEQ0W8OJhkiiuQHg29-bA
CODEN	ITMUF8
CitedBy_id	crossref_primary_10_1049_iet_ipr_2018_5949 crossref_primary_10_1109_ACCESS_2018_2878899
Cites_doi	10.1145/1101149.1101154 10.1109/CVPR.2014.309 10.1109/CVPR.2013.260 10.1109/CVPR.2010.5540000 10.1109/CVPR.2009.5206816 10.1109/TIP.2014.2310992 10.1109/TIP.2009.2017128 10.1109/ICCV.2013.344 10.1109/CVPR.2010.5540120 10.1109/ICCV.2013.371 10.1109/TMM.2013.2280895 10.1109/TPAMI.2012.79 10.1109/CVPR.2006.68 10.1109/TMM.2013.2267726 10.1109/CVPR.2006.95 10.1007/s11263-014-0733-5 10.1109/CVPR.2014.81 10.1109/CVPR.2010.5540112 10.1109/ICCV.2011.6126229 10.1109/CVPR.2014.539 10.1109/CVPR.2015.7298711 10.1109/TMM.2014.2306655 10.1109/TIP.2012.2202676 10.1109/TPAMI.2009.167 10.1109/CVPR.2010.5540018 10.1145/860458.860460
ContentType	Journal Article
Copyright	Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2016
Copyright_xml	– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2016
DBID	97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
DOI	10.1109/TMM.2016.2520087
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional
DatabaseTitleList	Technology Research Database
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1941-0077
EndPage	520
ExternalDocumentID	4047732591 10_1109_TMM_2016_2520087 7387771
Genre	orig-research
GrantInformation_xml	– fundername: NSFC grantid: 61425011; U1201255; 61271218; 61529101; 61472234; 61271211 funderid: 10.13039/100000001 – fundername: Shu Guanga
GroupedDBID	-~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P PQQKQ RIA RIE RNS TN5 VH1 ZY4 AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c291t-2de9be62294064727f22715947d97f421ed9522394ae841b90e522b7b76b2fd23
IEDL.DBID	RIE
ISSN	1520-9210
IngestDate	Sun Jun 29 15:22:06 EDT 2025 Thu Apr 24 23:04:15 EDT 2025 Wed Oct 01 01:33:23 EDT 2025 Tue Aug 26 16:42:56 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	3
Keywords	Conditional random field scene classification object classification object localization
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c291t-2de9be62294064727f22715947d97f421ed9522394ae841b90e522b7b76b2fd23
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
PQID	1787248208
PQPubID	75737
PageCount	14
ParticipantIDs	ieee_primary_7387771 crossref_citationtrail_10_1109_TMM_2016_2520087 crossref_primary_10_1109_TMM_2016_2520087 proquest_journals_1787248208
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2016-03-01
PublicationDateYYYYMMDD	2016-03-01
PublicationDate_xml	– month: 03 year: 2016 text: 2016-03-01 day: 01
PublicationDecade	2010
PublicationPlace	Piscataway
PublicationPlace_xml	– name: Piscataway
PublicationTitle	IEEE transactions on multimedia
PublicationTitleAbbrev	TMM
PublicationYear	2016
Publisher	IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml	– name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References	ref35 ref12 ref36 ref14 ref31 ref30 ref32 ref10 barnard (ref21) 2003; 3 ref2 jia (ref25) 0 ref1 ref39 ref17 ref38 karpathy (ref11) 0 ref19 ref18 larochelle (ref26) 0 li (ref16) 0 torresani (ref40) 0 ref24 ref23 ref20 li (ref34) 0 gupta (ref15) 0 ref28 ref27 farhadi (ref5) 0 yang (ref37) 0 ref29 ref8 ref7 ref9 ref4 ref3 blei (ref22) 2003; 3 wang (ref6) 0 li (ref13) 0 klein (ref33) 0
References_xml	– start-page: 3 year: 0 ident: ref33 article-title: Fast exact inference with a factored model for natural language parsing publication-title: Proc Adv Neural Inform Process Syst – volume: 3 start-page: 993 year: 2003 ident: ref22 article-title: Latent dirichlet allocation publication-title: J Mach Learn Res – ident: ref19 doi: 10.1145/1101149.1101154 – ident: ref29 doi: 10.1109/CVPR.2014.309 – start-page: 16 year: 0 ident: ref15 article-title: Beyond nouns: Exploiting prepositions and comparative adjectives for learning visual classifiers publication-title: Proc Eur Conf Comput Vis – start-page: 1378 year: 0 ident: ref34 article-title: Object bank: A high-level image representation for scene classification & semantic feature sparsification publication-title: Proc Adv Neural Inform Process Syst – ident: ref14 doi: 10.1109/CVPR.2013.260 – start-page: 15 year: 0 ident: ref5 article-title: Every picture tells a story: Generating sentences from images publication-title: Proc Eur Conf Comput Vis – start-page: 2036 year: 0 ident: ref13 article-title: Towards total scene understanding: Classification, annotation and segmentation in an automatic framework publication-title: Proc IEEE Conf Comput Vis Pattern Recog – ident: ref24 doi: 10.1109/CVPR.2010.5540000 – ident: ref17 doi: 10.1109/CVPR.2009.5206816 – ident: ref20 doi: 10.1109/TIP.2014.2310992 – start-page: 2407 year: 0 ident: ref25 article-title: Learning cross-modality similarity for multinomial data publication-title: Proc IEEE Int Conf Comput Vis – ident: ref3 doi: 10.1109/TIP.2009.2017128 – ident: ref10 doi: 10.1109/ICCV.2013.344 – ident: ref18 doi: 10.1109/CVPR.2010.5540120 – start-page: 1957 year: 0 ident: ref16 article-title: Landmark classification in large-scale image collections publication-title: Proc IEEE Int Conf Comput Vis – start-page: 2397 year: 0 ident: ref6 article-title: A discriminative latent model of image region and object tag correspondence publication-title: Proc Adv Neural Inform Process Syst – ident: ref31 doi: 10.1109/ICCV.2013.371 – ident: ref8 doi: 10.1109/TMM.2013.2280895 – ident: ref36 doi: 10.1109/TPAMI.2012.79 – ident: ref2 doi: 10.1109/CVPR.2006.68 – ident: ref9 doi: 10.1109/TMM.2013.2267726 – ident: ref39 doi: 10.1109/CVPR.2006.95 – ident: ref4 doi: 10.1007/s11263-014-0733-5 – ident: ref32 doi: 10.1109/CVPR.2014.81 – ident: ref7 doi: 10.1109/CVPR.2010.5540112 – ident: ref35 doi: 10.1109/ICCV.2011.6126229 – ident: ref27 doi: 10.1109/CVPR.2014.539 – ident: ref30 doi: 10.1109/CVPR.2015.7298711 – start-page: 1794 year: 0 ident: ref37 article-title: Linear spatial pyramid matching using sparse coding for image classification publication-title: Proc IEEE Conf Comput Vis Pattern Recog – ident: ref12 doi: 10.1109/TMM.2014.2306655 – ident: ref28 doi: 10.1109/TIP.2012.2202676 – start-page: 1889 year: 0 ident: ref11 article-title: Deep fragment embeddings for bidirectional image sentence mapping publication-title: Proc Adv Neural Inform Process Syst – ident: ref1 doi: 10.1109/TPAMI.2009.167 – ident: ref38 doi: 10.1109/CVPR.2010.5540018 – ident: ref23 doi: 10.1145/860458.860460 – start-page: 776 year: 0 ident: ref40 article-title: Efficient object category recognition using classemes publication-title: Proc Eur Conf Comput Vis – volume: 3 start-page: 1107 year: 2003 ident: ref21 article-title: Matching words and pictures publication-title: J Mach Learn Res – start-page: 2717 year: 0 ident: ref26 article-title: A neural autoregressive topic model publication-title: Proc Adv Neural Inform Process Syst
SSID	ssj0014507
Score	2.1975234
Snippet	The rapid growth of web images presents new challenges as well as opportunities to the task of image understanding. Conventional approaches rely heavily on...
SourceID	proquest crossref ieee
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	507
SubjectTerms	Algorithms Bicycles conditional random field Image segmentation Internet object classification Object detection object localization Prediction algorithms Scene classification Semantics Visualization
Title	Joint Inference of Objects and Scenes With Efficient Learning of Text-Object-Scene Relations
URI	https://ieeexplore.ieee.org/document/7387771 https://www.proquest.com/docview/1787248208
Volume	18
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1941-0077 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0014507 issn: 1520-9210 databaseCode: RIE dateStart: 19990101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8QwEB7Ukx5cn7i6Sg5eBLO7TdKkOYooKqxedtGDUJomUVG64nYv_nqT9MGiIt5amGkDXx4zmZlvAI6p1plkwmLKeYyZ4Bw7F8zinJOMZYkQUe4v9Ee3_GrCbh7ihyU4bWthjDEh-cz0_WOI5etpPvdXZQNBPXud83WWRcKrWq02YsDiUBrtjqMhls6PaUKSQzkYj0Y-h4v3iecY8slzC0dQ6KnyYyMOp8tlB0bNuKqkktf-vFT9_PMbZeN_B74B67WZic6qebEJS6bYgk7TwgHVK3oL1hb4CLfh8Wb6UpTouikCRFOL7pS_qZmhrNBOze-M6P6lfEYXgXvC_RbVFK1PXnrsHelKBQdp1Gbb7cDk8mJ8foXr9gs4JzIqMdFGKsMJkcxXpBJhCRHO-mFCS2EZiYyWznqjkmUmYZGSQ-NelVCCK2I1obuwUkwLswdIxW7hK5o4Y4cwSvPExtrY3CoRM81p1IVBg0ia19zkvkXGWxp8lKFMHYapxzCtMezCSavxXvFy_CG77SFp5Wo0utBrQE_rhTtLI7eBEebMomT_d60DWPXfrtLQerBSfszNobNLSnUUJuQXP0rb-Q
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NTxsxEB0hOLQcSJuAGkhbH7hUwknW9trrI0JEgbJwSVQOlVbrtU0RaINgc-HXY3s_hEqFetuVZmRLz2PP2DNvAA6p1rlkwmLKeYyZ4By7EMzigpOc5YkQUeEv9NNLPl-y8-v4egOOuloYY0xIPjNj_xne8vWqWPursomgnr3OxTpbMWMsrqu1ujcDFofiaHcgTbF0kUz7KDmVk0Wa-iwuPiaeZcinz706hEJXlTdbcThfZj1I25nVaSV343WlxsXzX6SN_zv1T7DTOJrouF4Zn2HDlH3otU0cUGPTfdh-xUg4gN_nq9uyQmdtGSBaWXSl_F3NE8pL7dT83oh-3VZ_0Glgn3DDooak9cZLL3woXavgII26fLtdWM5OFydz3DRgwAWRUYWJNlIZTohkviaVCEuIcP4PE1oKy0hktHT-G5UsNwmLlJwa96uEElwRqwndg81yVZovgFTsTF_RxLk7hFFaJDbWxhZWiZhpTqMhTFpEsqJhJ_dNMu6zEKVMZeYwzDyGWYPhEH50Gg81M8c7sgMPSSfXoDGEUQt61pjuUxa5LYww5xgl-__W-g4f5ov0Irs4u_x5AB_9OHVS2gg2q8e1-eq8lEp9C4vzBe5z30Y
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Joint+Inference+of+Objects+and+Scenes+With+Efficient+Learning+of+Text-Object-Scene+Relations&rft.jtitle=IEEE+transactions+on+multimedia&rft.au=Botao+Wang&rft.au=Dahua+Lin&rft.au=Hongkai+Xiong&rft.au=Zheng%2C+Y.+F.&rft.date=2016-03-01&rft.pub=IEEE&rft.issn=1520-9210&rft.volume=18&rft.issue=3&rft.spage=507&rft.epage=520&rft_id=info:doi/10.1109%2FTMM.2016.2520087&rft.externalDocID=7387771
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1520-9210&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1520-9210&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1520-9210&client=summon