Feedback LSTM network based on attention for image description generator

Images are complex multimedia data which contain rich semantic information. Most of current image description generator algorithms only generate plain description, with the lack of distinction between primary and secondary object, leading to insufficient high-level semantic and accuracy under public...

Full description

Saved in:

Bibliographic Details
Published in	Computers, materials & continua Vol. 59; no. 2; pp. 575 - 589
Main Authors	Qu, Zhaowei, Cao, Bingyu, Wang, Xiaoru, Li, Fu, Xu, Peirong, Zhang, Luhan
Format	Journal Article
Language	English
Published	Henderson Tech Science Press 2019
Subjects	Algorithms Cider Codec Decoding Feedback Image coding Image enhancement Mapping Modules Multimedia Semantics Words (language)
Online Access	Get full text
ISSN	1546-2226 1546-2218 1546-2226
DOI	10.32604/cmc.2019.05569

Cover

More Information
Summary:	Images are complex multimedia data which contain rich semantic information. Most of current image description generator algorithms only generate plain description, with the lack of distinction between primary and secondary object, leading to insufficient high-level semantic and accuracy under public evaluation criteria. The major issue is the lack of effective network on high-level semantic sentences generation, which contains detailed description for motion and state of the principal object. To address the issue, this paper proposes the Attention-based Feedback Long Short-Term Memory Network (AFLN). Based on existing codec framework, there are two independent sub tasks in our method: attention-based feedback LSTM network during decoding and the Convolutional Block Attention Module (CBAM) in the coding phase. First, we propose an attention-based network to feedback the features corresponding to the generated word from the previous LSTM decoding unit. We implement feedback guidance through the related field mapping algorithm, which quantifies the correlation between previous word and latter word, so that the main object can be tracked with highlighted detailed description. Second, we exploit the attention idea and apply a lightweight and general module called CBAM after the last layer of VGG 16 pretraining network, which can enhance the expression of image coding features by combining channel and spatial dimension attention maps with negligible overheads. Extensive experiments on COCO dataset validate the superiority of our network over the state-of-the-art algorithms. Both scores and actual effects are proved. The BLEU 4 score increases from 0.291 to 0.301 while the CIDEr score rising from 0.912 to 0.952.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1546-2226 1546-2218 1546-2226
DOI:	10.32604/cmc.2019.05569