A New Memory Based on Sequence to Sequence Model for Video Captioning

In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rap...

Full description

Saved in:
Bibliographic Details
Published in2021 International Conference on Security, Pattern Analysis, and Cybernetics(SPAC pp. 470 - 476
Main Authors Lin, Jin-Cheng, Zhang, Chun-Yang
Format Conference Proceeding
LanguageEnglish
Published IEEE 18.06.2021
Subjects
Online AccessGet full text
DOI10.1109/SPAC53836.2021.9539903

Cover

Abstract In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models.
AbstractList In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models.
Author Lin, Jin-Cheng
Zhang, Chun-Yang
Author_xml – sequence: 1
  givenname: Jin-Cheng
  surname: Lin
  fullname: Lin, Jin-Cheng
  email: jinchengll@qq.com
  organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China
– sequence: 2
  givenname: Chun-Yang
  surname: Zhang
  fullname: Zhang, Chun-Yang
  email: zhangcy@fzu.edu.cn
  organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China
BookMark eNpFj91KwzAcRyPohZs-gSB5gdb889X2spb5AZsKU29H1vwigS2ZXUX29goOvDrn6sCZsNOUExi7JlESieZm-dJ2RtXKllJIKhujmkaoEzYha43WSsrqnM1a_oRvvsA2Dwd-6_bwPCe-xOcXUg8-5n9fZI8ND3ng79Ej887txphTTB8X7Cy4zR6XR07Z293stXso5s_3j107LyJRPRZVMLCoHZFtqNemNuKXFVkjpCbvegiI4GFDoMo7K7XEGmthCSRdUGrKrv66EcBqN8StGw6r45n6AQZUR68
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SPAC53836.2021.9539903
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665443227
9781665443227
EndPage 476
ExternalDocumentID 9539903
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  grantid: 62076065,61751202,61751205,61572540,U1813203,U1801262
  funderid: 10.13039/501100001809
– fundername: Natural Science Foundation of Fujian Province
  grantid: 2020J01495
  funderid: 10.13039/501100003392
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:37 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33
PageCount 7
ParticipantIDs ieee_primary_9539903
PublicationCentury 2000
PublicationDate 2021-June-18
PublicationDateYYYYMMDD 2021-06-18
PublicationDate_xml – month: 06
  year: 2021
  text: 2021-June-18
  day: 18
PublicationDecade 2020
PublicationTitle 2021 International Conference on Security, Pattern Analysis, and Cybernetics(SPAC
PublicationTitleAbbrev SPAC
PublicationYear 2021
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8008591
Snippet In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents...
SourceID ieee
SourceType Publisher
StartPage 470
SubjectTerms Attention Module
Correlation
Deep learning
Feature extraction
NTM
Representation learning
Security
Sequence to Sequence
Supervised learning
Training
Video Captioning
Visualization
Title A New Memory Based on Sequence to Sequence Model for Video Captioning
URI https://ieeexplore.ieee.org/document/9539903
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF1qT55UWvGbPXg0aTbZJM2xlpYiVAq10lvZj1koSlIkOdRf70waWxQP3pYQskmG7JudvPeGsftIUwEsk16YKOlJaUNPORL4ZM6oREMKhgr60-dkspBPy3jZYg97LQwA1OQz8GlY_8u3hamoVNbLyEaVrD2P0jTbabUa0a8Ist58Nhji5xsR8SAUfnPyj64pNWiMT9j0e7odV-TNr0rtm89fToz_vZ9T1j3I8_hsDzxnrAV5h40GHBcsPiXm7JY_IjhZXuR83lCleVkcxtQA7Z1juspf1xYKPlSbpi7bZYvx6GU48ZoeCd4atwall7oYEugrQaY3RmLyjwmAwU1QjOArrDIQQOAsJM6J1CpSg2DYNKI4iFC5KDpn7bzI4YJxg7mdsU5JgRfSNs5UILS0JgwUyFTbS9ahV7Da7GwwVs3TX_19-JodUxiIVSX6N6xdflRwi_hd6rs6cF9CgZwU
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0QPOhJDRi_3YNHW7rttqVHJBBUSkgAw43sx2xCJC0x5aC_3p1SIRoP3jZN069J-2am770h5D6Q2ABLuONHgjuca98RBgU-iVEikhCDwoZ-OooGM_48D-c18rDTwgBAST4DF5flv3ydqw22yloJ2qiitedBaKuKeKvWqmS_zEtak3Gna1_gAKkHPnOr3X_MTSlho39M0u8Tbtkib-6mkK76_OXF-N8rOiHNvUCPjnfQc0pqkDVIr0PtJ4umyJ39oI8WnjTNMzqpyNK0yPdrHIG2ojZhpa9LDTntinXVmW2SWb837Q6cakqCs7TFQeHEJoQI2oKh7Y3iNv23KYCyZVBo4ZdpocADz2iIjGGxFqgHsYGTFseB-cIEwRmpZ3kG54Qqm90pbQRn9kBSh4nwmORa-Z4AHkt9QRr4CBbrrRHGorr7y78335HDwTQdLoZPo5crcoQhQY4Va1-TevG-gRuL5oW8LYP4BacQn2U
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+International+Conference+on+Security%2C+Pattern+Analysis%2C+and+Cybernetics%EF%BC%88SPAC&rft.atitle=A+New+Memory+Based+on+Sequence+to+Sequence+Model+for+Video+Captioning&rft.au=Lin%2C+Jin-Cheng&rft.au=Zhang%2C+Chun-Yang&rft.date=2021-06-18&rft.pub=IEEE&rft.spage=470&rft.epage=476&rft_id=info:doi/10.1109%2FSPAC53836.2021.9539903&rft.externalDocID=9539903