应用动态Token的融合特征的持续图像字幕生成

TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而...

Full description

Saved in:
Bibliographic Details
Published in计算机工程与应用 Vol. 61; no. 4; pp. 176 - 191
Main Authors 晋嘉利, 余璐
Format Journal Article
LanguageChinese
Published 天津理工大学 计算机科学与工程学院,天津 300384 15.02.2025
Subjects
Online AccessGet full text
ISSN1002-8331
DOI10.3778/j.issn.1002-8331.2309-0403

Cover

Abstract TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点.利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘.在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法.以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%.
AbstractList TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是非独立同分布的数据流,这种设置下的持续图像字幕生成任务更具有挑战性.目前针对图像字幕生成的多模态任务的持续学习研究较少,缺乏更适用于基于自注意力模型的持续图像字幕生成方法.针对以上挑战提出了一种应用动态Token的融合特征的持续图像字幕生成方法.在Transformer中对图像字幕生成任务所涉及的不同模态的数据特征进行融合,并对融合特征进行正则化计算;为每一个子任务定义一个Token,Token将随着子任务的切换而变化,这种Token即为动态Token,相比于整个训练阶段只定义一个且被所有子任务共用的静态Token而言,动态Token更能保存每个子任务特有的信息和特点.利用这些动态任务Token和任务标识融合特征注意力模块进一步获得具有任务标识信息的融合特征,并在每个子任务训练结束后保存其对应的Token,以保持模型对旧任务的记忆和表达能力,减少模型对旧任务的灾难性遗忘.在MS-COCO和Flickr30k数据集上的实验结果表明,应用动态Token的融合特征的持续图像字幕生成方法在Transformer架构上优于所有基线方法.以CIDEr指标为例,所有训练任务结束后CIDEr指标的平均分数相较于微调和所有基线方法中的最优方法分别提高了31.06%和13.94%.
Abstract_FL Architectures based on self-attention mechanisms,such as Transformer,exhibit outstanding performance advan-tages in image captioning tasks.However,in the majority of these approaches,models are only trained on static and identi-cally distributed datasets.In reality,data distributions are mostly non-independent and non-identically distributed data streams,making the task of continual image captioning under such settings more challenging.Notably,there is limited research on continual learning for multi-modal tasks like image captioning,and there is a lack of continual image captioning methods that are well-suited for self-attention-based models.To address these challenges,a continual image captioning method with a dynamic Token-based fusion feature is proposed.Different modalities of data features involved in image captioning tasks are fused within the Transformer.Regularization is applied to the fusion features.A Token is designated for each sub-task,and the Token will change as subtasks are switched,which is called the dynamic Token.Compared with the static Token that is defined only once in the whole training phase and is shared by all subtasks,the dynamic Token is better to preserve the information and characteristics specific to each subtask.Using these dynamic task Token and task identity fusion features attention module further obtains fusion features with task identity information and saves its corresponding Token after the training of each subtask to maintain the model's memory and expressive capabilities for previous tasks and reduce catastrophic forgetting.Experimental results on the MS-COCO and Flickr30k datasets demonstrate the superiority of the continual image captioning method with a dynamic Token-used fusion feature within the Transformer architecture over all baseline methods.For instance,in terms of the CIDEr metric,the average score of the CIDEr metric after completing all training tasks are increased by 31.06%compared to fine-tuning and 13.94%com-pared to the best-performing baseline method among all the baseline methods.
Author 晋嘉利
余璐
AuthorAffiliation 天津理工大学 计算机科学与工程学院,天津 300384
AuthorAffiliation_xml – name: 天津理工大学 计算机科学与工程学院,天津 300384
Author_FL YU Lu
JIN Jiali
Author_FL_xml – sequence: 1
  fullname: JIN Jiali
– sequence: 2
  fullname: YU Lu
Author_xml – sequence: 1
  fullname: 晋嘉利
– sequence: 2
  fullname: 余璐
BookMark eNo9j01LwzAcxnOY4Jz7Et48tP6TNG2DJxm-wcDLPI-0S8aqpGAQ6c2ihymKO1QEBb3usg8whX2btH4MA4qnh99zeF42UEvnWiK0hcGnURTvZP7EGO1jAOLFlGKfUOAeBEBbqP3vrqOuMZMEGKYRiyhvo137WTXV3N7P6-tykJ9J3bzefr8_2tm0uVvaVemwfiibr4V9W9mbJ7t4scvnpvqop7NNtKbEuZHdP-2g04P9Qe_I658cHvf2-p7BEIYekamQI57IUcKF2xoQzAK3QDKIQ6kAc0a54mFMFMYBJEIy4QAwpCISUtEO2v7NvRJaCT0eZvnlhXaNw8xk47QoCgKEua84pD_K8GDF
ClassificationCodes TP391
ContentType Journal Article
Copyright Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
Copyright_xml – notice: Copyright © Wanfang Data Co. Ltd. All Rights Reserved.
DBID 2B.
4A8
92I
93N
PSX
TCJ
DOI 10.3778/j.issn.1002-8331.2309-0403
DatabaseName Wanfang Data Journals - Hong Kong
WANFANG Data Centre
Wanfang Data Journals
万方数据期刊 - 香港版
China Online Journals (COJ)
China Online Journals (COJ)
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
DocumentTitle_FL Continual Image Captioning with Dynamic Token-Used Fusion Feature
EndPage 191
ExternalDocumentID jsjgcyyy202504016
GrantInformation_xml – fundername: (国家自然科学基金); (天津理工大学校级研究生科研创新实践项目)
  funderid: (国家自然科学基金); (天津理工大学校级研究生科研创新实践项目)
GroupedDBID -0Y
2B.
4A8
5XA
5XJ
92H
92I
93N
ABJNI
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CCEZO
CUBFJ
CW9
PSX
TCJ
TGT
U1G
U5S
ID FETCH-LOGICAL-s1066-2ecaed9bedb9a77842154513e5086ef019539f9682f1140bae5a682010ca7aef3
ISSN 1002-8331
IngestDate Thu May 29 04:10:55 EDT 2025
IsPeerReviewed false
IsScholarly false
Issue 4
Keywords Transformer
continual learning
融合特征
image captioning
动态Token
图像字幕生成
regularization
fusion feature
dynamic Token
正则化
持续学习
Language Chinese
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-s1066-2ecaed9bedb9a77842154513e5086ef019539f9682f1140bae5a682010ca7aef3
PageCount 16
ParticipantIDs wanfang_journals_jsjgcyyy202504016
PublicationCentury 2000
PublicationDate 2025-02-15
PublicationDateYYYYMMDD 2025-02-15
PublicationDate_xml – month: 02
  year: 2025
  text: 2025-02-15
  day: 15
PublicationDecade 2020
PublicationTitle 计算机工程与应用
PublicationTitle_FL Computer Engineering and Applications
PublicationYear 2025
Publisher 天津理工大学 计算机科学与工程学院,天津 300384
Publisher_xml – name: 天津理工大学 计算机科学与工程学院,天津 300384
SSID ssib051375739
ssib001102935
ssj0000561668
ssib023646291
ssib057620132
Score 2.010622
Snippet TP391; 基于自注意力的结构(如Transformer)在图像字幕生成任务中有着突出的性能优势.但在大多数方法中模型只在静态、同分布数据集上进行训练,而真实世界中的数据分布大多是...
SourceID wanfang
SourceType Aggregation Database
StartPage 176
Title 应用动态Token的融合特征的持续图像字幕生成
URI https://d.wanfangdata.com.cn/periodical/jsjgcyyy202504016
Volume 61
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVEBS
  databaseName: Inspec with Full Text
  issn: 1002-8331
  databaseCode: ADMLS
  dateStart: 20200501
  customDbUrl:
  isFulltext: true
  dateEnd: 99991231
  titleUrlDefault: https://www.ebsco.com/products/research-databases/inspec-full-text
  omitProxy: false
  ssIdentifier: ssib057620132
  providerName: EBSCOhost
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELb6uMAB8RRvVQifqi1rJ_FDnJI2VYUoF1qpt8pJnVZF2kpse2hPVHAoCEQPi5BAgmsv_QEFqf8mXX4GM467CS2gwiXy2uPxNztxZpx4xoTcV5ZFObOixbWBBQoPo5bORdiyubTa5gWPLAYnzz4RM_Pho4VoYWi429i1tLGeTeRbv40r-R-tQh3oFaNk_0GzA6ZQAWXQL1xBw3A9k45pGtEkpjqkqcRrrLBGxa4gcBODYnNrz-DJgu0xVUCoqE6pmkJCDe0Km5SmiXbMoIk1iIHHpK9JEhpXvRIkw3ECqqaxAPVauu6a6shj0dOuOwzXbjrACCBOaex4QgE7CqonUQ7kAJWOA0igEpqGNIFCelrQ4xvF9daOFqApFAWhwSC6JgEuU0iFnfkxIP-qg7vQ8SrY092cTiIYQiPrJPR48L8SDYQVjfTix2L8r4JJJNVsQP2rYCdkHtBoxBxzmAWnMY0H-LU1bNgUZ3SOA9O80aky0PvJFTYsCJOi4Yyw6iizk3YukFI5O4cDTAwGwF39-KmrHdTWfbDncrW7upxvbm5yl7UOPP1hMsqlEHyEjMZTs4-f1l40OJ269qLxiAHB65RKEQtkJOtksrBi5W2fANSn1BdM-CBUj6xK94uwH_wZtIub6xSms9xw8eYukgt-bTYWVxPtEhnaWrlMzjcydl4hD8tvvX5vr3yzd_Ri202t_qdXP768K3d3-q8PysNt-Hn0drv_fb_8fFi-fF_ufywPPvR7X492dq-S-el0bnKm5c8faXUZeOItbnNjl3RmlzJtAHrIcb3BAguLGmELDLUNdKGF4gVjYTszNjICPep2bqSxRXCNjHTWOvY6GZOiyAudK2YyE2aqUBnXbWsKZcLA5EvhDXLPi77ony_dxVMKu3kWolvkXD11bpOR9ecb9g74zevZXa_nn9KdkK4
linkProvider EBSCOhost
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=%E5%BA%94%E7%94%A8%E5%8A%A8%E6%80%81Token%E7%9A%84%E8%9E%8D%E5%90%88%E7%89%B9%E5%BE%81%E7%9A%84%E6%8C%81%E7%BB%AD%E5%9B%BE%E5%83%8F%E5%AD%97%E5%B9%95%E7%94%9F%E6%88%90&rft.jtitle=%E8%AE%A1%E7%AE%97%E6%9C%BA%E5%B7%A5%E7%A8%8B%E4%B8%8E%E5%BA%94%E7%94%A8&rft.au=%E6%99%8B%E5%98%89%E5%88%A9&rft.au=%E4%BD%99%E7%92%90&rft.date=2025-02-15&rft.pub=%E5%A4%A9%E6%B4%A5%E7%90%86%E5%B7%A5%E5%A4%A7%E5%AD%A6+%E8%AE%A1%E7%AE%97%E6%9C%BA%E7%A7%91%E5%AD%A6%E4%B8%8E%E5%B7%A5%E7%A8%8B%E5%AD%A6%E9%99%A2%2C%E5%A4%A9%E6%B4%A5+300384&rft.issn=1002-8331&rft.volume=61&rft.issue=4&rft.spage=176&rft.epage=191&rft_id=info:doi/10.3778%2Fj.issn.1002-8331.2309-0403&rft.externalDocID=jsjgcyyy202504016
thumbnail_s http://utb.summon.serialssolutions.com/2.0.0/image/custom?url=http%3A%2F%2Fwww.wanfangdata.com.cn%2Fimages%2FPeriodicalImages%2Fjsjgcyyy%2Fjsjgcyyy.jpg