应用动态Token的融合特征的持续图像字幕生成
TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而...
Saved in:
Published in | 计算机工程与应用 Vol. 61; no. 4; pp. 176 - 191 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | Chinese |
Published |
天津理工大学 计算机科学与工程学院,天津 300384
15.02.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 1002-8331 |
DOI | 10.3778/j.issn.1002-8331.2309-0403 |
Cover
Abstract | TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点.利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘.在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法.以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%. |
---|---|
AbstractList | TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点.利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘.在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法.以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%. |
Abstract_FL | Architectures based on self-attention mechanisms,such as Transformer,exhibit outstanding performance advan-tages in image captioning tasks.However,in the majority of these approaches,models are only trained on static and identi-cally distributed datasets.In reality,data distributions are mostly non-independent and non-identically distributed data streams,making the task of continual image captioning under such settings more challenging.Notably,there is limited research on continual learning for multi-modal tasks like image captioning,and there is a lack of continual image captioning methods that are well-suited for self-attention-based models.To address these challenges,a continual image captioning method with a dynamic Token-based fusion feature is proposed.Different modalities of data features involved in image captioning tasks are fused within the Transformer.Regularization is applied to the fusion features.A Token is designated for each sub-task,and the Token will change as subtasks are switched,which is called the dynamic Token.Compared with the static Token that is defined only once in the whole training phase and is shared by all subtasks,the dynamic Token is better to preserve the information and characteristics specific to each subtask.Using these dynamic task Token and task identity fusion features attention module further obtains fusion features with task identity information and saves its corresponding Token after the training of each subtask to maintain the model's memory and expressive capabilities for previous tasks and reduce catastrophic forgetting.Experimental results on the MS-COCO and Flickr30k datasets demonstrate the superiority of the continual image captioning method with a dynamic Token-used fusion feature within the Transformer architecture over all baseline methods.For instance,in terms of the CIDEr metric,the average score of the CIDEr metric after completing all training tasks are increased by 31.06%compared to fine-tuning and 13.94%com-pared to the best-performing baseline method among all the baseline methods. |
Author | 晋嘉利 余璐 |
AuthorAffiliation | 天津理工大学 计算机科学与工程学院,天津 300384 |
AuthorAffiliation_xml | – name: 天津理工大学 计算机科学与工程学院,天津 300384 |
Author_FL | YU Lu JIN Jiali |
Author_FL_xml | – sequence: 1 fullname: JIN Jiali – sequence: 2 fullname: YU Lu |
Author_xml | – sequence: 1 fullname: 晋嘉利 – sequence: 2 fullname: 余璐 |
BookMark | eNo9j01LwzAcxnOY4Jz7Et48tP6TNG2DJxm-wcDLPI-0S8aqpGAQ6c2ihymKO1QEBb3usg8whX2btH4MA4qnh99zeF42UEvnWiK0hcGnURTvZP7EGO1jAOLFlGKfUOAeBEBbqP3vrqOuMZMEGKYRiyhvo137WTXV3N7P6-tykJ9J3bzefr8_2tm0uVvaVemwfiibr4V9W9mbJ7t4scvnpvqop7NNtKbEuZHdP-2g04P9Qe_I658cHvf2-p7BEIYekamQI57IUcKF2xoQzAK3QDKIQ6kAc0a54mFMFMYBJEIy4QAwpCISUtEO2v7NvRJaCT0eZvnlhXaNw8xk47QoCgKEua84pD_K8GDF |
ClassificationCodes | TP391 |
ContentType | Journal Article |
Copyright | Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
Copyright_xml | – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved. |
DBID | 2B. 4A8 92I 93N PSX TCJ |
DOI | 10.3778/j.issn.1002-8331.2309-0403 |
DatabaseName | Wanfang Data Journals - Hong Kong WANFANG Data Centre Wanfang Data Journals 万方数据期刊 - 香港版 China Online Journals (COJ) China Online Journals (COJ) |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
DocumentTitle_FL | Continual Image Captioning with Dynamic Token-Used Fusion Feature |
EndPage | 191 |
ExternalDocumentID | jsjgcyyy202504016 |
GrantInformation_xml | – fundername: (国家自然科学基金); (天津理工大学校级研究生科研创新实践项目) funderid: (国家自然科学基金); (天津理工大学校级研究生科研创新实践项目) |
GroupedDBID | -0Y 2B. 4A8 5XA 5XJ 92H 92I 93N ABJNI ACGFS ALMA_UNASSIGNED_HOLDINGS CCEZO CUBFJ CW9 PSX TCJ TGT U1G U5S |
ID | FETCH-LOGICAL-s1066-2ecaed9bedb9a77842154513e5086ef019539f9682f1140bae5a682010ca7aef3 |
ISSN | 1002-8331 |
IngestDate | Thu May 29 04:10:55 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Issue | 4 |
Keywords | Transformer continual learning 融合特征 image captioning 动态Token 图像字幕生成 regularization fusion feature dynamic Token 正则化 持续学习 |
Language | Chinese |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-s1066-2ecaed9bedb9a77842154513e5086ef019539f9682f1140bae5a682010ca7aef3 |
PageCount | 16 |
ParticipantIDs | wanfang_journals_jsjgcyyy202504016 |
PublicationCentury | 2000 |
PublicationDate | 2025-02-15 |
PublicationDateYYYYMMDD | 2025-02-15 |
PublicationDate_xml | – month: 02 year: 2025 text: 2025-02-15 day: 15 |
PublicationDecade | 2020 |
PublicationTitle | 计算机工程与应用 |
PublicationTitle_FL | Computer Engineering and Applications |
PublicationYear | 2025 |
Publisher | 天津理工大学 计算机科学与工程学院,天津 300384 |
Publisher_xml | – name: 天津理工大学 计算机科学与工程学院,天津 300384 |
SSID | ssib051375739 ssib001102935 ssj0000561668 ssib023646291 ssib057620132 |
Score | 2.010622 |
Snippet | TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是... |
SourceID | wanfang |
SourceType | Aggregation Database |
StartPage | 176 |
Title | 应用动态Token的融合特征的持续图像字幕生成 |
URI | https://d.wanfangdata.com.cn/periodical/jsjgcyyy202504016 |
Volume | 61 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
journalDatabaseRights | – providerCode: PRVEBS databaseName: Inspec with Full Text issn: 1002-8331 databaseCode: ADMLS dateStart: 20200501 customDbUrl: isFulltext: true dateEnd: 99991231 titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text omitProxy: false ssIdentifier: ssib057620132 providerName: EBSCOhost |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELb6uMAB8RRvVQifqi1rJ_FDnJI2VYUoF1qpt8pJnVZF2kpse2hPVHAoCEQPi5BAgmsv_QEFqf8mXX4GM467CS2gwiXy2uPxNztxZpx4xoTcV5ZFObOixbWBBQoPo5bORdiyubTa5gWPLAYnzz4RM_Pho4VoYWi429i1tLGeTeRbv40r-R-tQh3oFaNk_0GzA6ZQAWXQL1xBw3A9k45pGtEkpjqkqcRrrLBGxa4gcBODYnNrz-DJgu0xVUCoqE6pmkJCDe0Km5SmiXbMoIk1iIHHpK9JEhpXvRIkw3ECqqaxAPVauu6a6shj0dOuOwzXbjrACCBOaex4QgE7CqonUQ7kAJWOA0igEpqGNIFCelrQ4xvF9daOFqApFAWhwSC6JgEuU0iFnfkxIP-qg7vQ8SrY092cTiIYQiPrJPR48L8SDYQVjfTix2L8r4JJJNVsQP2rYCdkHtBoxBxzmAWnMY0H-LU1bNgUZ3SOA9O80aky0PvJFTYsCJOi4Yyw6iizk3YukFI5O4cDTAwGwF39-KmrHdTWfbDncrW7upxvbm5yl7UOPP1hMsqlEHyEjMZTs4-f1l40OJ269qLxiAHB65RKEQtkJOtksrBi5W2fANSn1BdM-CBUj6xK94uwH_wZtIub6xSms9xw8eYukgt-bTYWVxPtEhnaWrlMzjcydl4hD8tvvX5vr3yzd_Ri202t_qdXP768K3d3-q8PysNt-Hn0drv_fb_8fFi-fF_ufywPPvR7X492dq-S-el0bnKm5c8faXUZeOItbnNjl3RmlzJtAHrIcb3BAguLGmELDLUNdKGF4gVjYTszNjICPep2bqSxRXCNjHTWOvY6GZOiyAudK2YyE2aqUBnXbWsKZcLA5EvhDXLPi77ony_dxVMKu3kWolvkXD11bpOR9ecb9g74zevZXa_nn9KdkK4 |
linkProvider | EBSCOhost |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%BA%94%E7%94%A8%E5%8A%A8%E6%80%81Token%E7%9A%84%E8%9E%8D%E5%90%88%E7%89%B9%E5%BE%81%E7%9A%84%E6%8C%81%E7%BB%AD%E5%9B%BE%E5%83%8F%E5%AD%97%E5%B9%95%E7%94%9F%E6%88%90&rft.jtitle=%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%B7%A5%E7%A8%8B%E4%B8%8E%E5%BA%94%E7%94%A8&rft.au=%E6%99%8B%E5%98%89%E5%88%A9&rft.au=%E4%BD%99%E7%92%90&rft.date=2025-02-15&rft.pub=%E5%A4%A9%E6%B4%A5%E7%90%86%E5%B7%A5%E5%A4%A7%E5%AD%A6+%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%91%E5%AD%A6%E4%B8%8E%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E5%A4%A9%E6%B4%A5+300384&rft.issn=1002-8331&rft.volume=61&rft.issue=4&rft.spage=176&rft.epage=191&rft_id=info:doi/10.3778%2Fj.issn.1002-8331.2309-0403&rft.externalDocID=jsjgcyyy202504016 |
thumbnail_s | http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fjsjgcyyy%2Fjsjgcyyy.jpg |