A New Memory Based on Sequence to Sequence Model for Video Captioning

In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rap...

Full description

Saved in:

Bibliographic Details
Published in	2021 International Conference on Security, Pattern Analysis, and Cybernetics（SPAC pp. 470 - 476
Main Authors	Lin, Jin-Cheng, Zhang, Chun-Yang
Format	Conference Proceeding
Language	English
Published	IEEE 18.06.2021
Subjects	Attention Module Correlation Deep learning Feature extraction NTM Representation learning Security Sequence to Sequence Supervised learning Training Video Captioning Visualization
Online Access	Get full text
DOI	10.1109/SPAC53836.2021.9539903

Cover

Abstract	In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models.
AbstractList	In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models.
Author	Lin, Jin-Cheng Zhang, Chun-Yang
Author_xml	– sequence: 1 givenname: Jin-Cheng surname: Lin fullname: Lin, Jin-Cheng email: jinchengll@qq.com organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China – sequence: 2 givenname: Chun-Yang surname: Zhang fullname: Zhang, Chun-Yang email: zhangcy@fzu.edu.cn organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China
BookMark	eNpFj91KwzAcRyPohZs-gSB5gdb889X2spb5AZsKU29H1vwigS2ZXUX29goOvDrn6sCZsNOUExi7JlESieZm-dJ2RtXKllJIKhujmkaoEzYha43WSsrqnM1a_oRvvsA2Dwd-6_bwPCe-xOcXUg8-5n9fZI8ND3ng79Ej887txphTTB8X7Cy4zR6XR07Z293stXso5s_3j107LyJRPRZVMLCoHZFtqNemNuKXFVkjpCbvegiI4GFDoMo7K7XEGmthCSRdUGrKrv66EcBqN8StGw6r45n6AQZUR68
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/SPAC53836.2021.9539903
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1665443227 9781665443227
EndPage	476
ExternalDocumentID	9539903
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China grantid: 62076065,61751202,61751205,61572540,U1813203,U1801262 funderid: 10.13039/501100001809 – fundername: Natural Science Foundation of Fujian Province grantid: 2020J01495 funderid: 10.13039/501100003392
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:37:37 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33
PageCount	7
ParticipantIDs	ieee_primary_9539903
PublicationCentury	2000
PublicationDate	2021-June-18
PublicationDateYYYYMMDD	2021-06-18
PublicationDate_xml	– month: 06 year: 2021 text: 2021-June-18 day: 18
PublicationDecade	2020
PublicationTitle	2021 International Conference on Security, Pattern Analysis, and Cybernetics（SPAC
PublicationTitleAbbrev	SPAC
PublicationYear	2021
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.8008591
Snippet	In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents...
SourceID	ieee
SourceType	Publisher
StartPage	470
SubjectTerms	Attention Module Correlation Deep learning Feature extraction NTM Representation learning Security Sequence to Sequence Supervised learning Training Video Captioning Visualization
Title	A New Memory Based on Sequence to Sequence Model for Video Captioning
URI	https://ieeexplore.ieee.org/document/9539903
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF1qT55UWvGbPXg0aTbZJM2xlpYiVAq10lvZj1koSlIkOdRf70waWxQP3pYQskmG7JudvPeGsftIUwEsk16YKOlJaUNPORL4ZM6oREMKhgr60-dkspBPy3jZYg97LQwA1OQz8GlY_8u3hamoVNbLyEaVrD2P0jTbabUa0a8Ist58Nhji5xsR8SAUfnPyj64pNWiMT9j0e7odV-TNr0rtm89fToz_vZ9T1j3I8_hsDzxnrAV5h40GHBcsPiXm7JY_IjhZXuR83lCleVkcxtQA7Z1juspf1xYKPlSbpi7bZYvx6GU48ZoeCd4atwall7oYEugrQaY3RmLyjwmAwU1QjOArrDIQQOAsJM6J1CpSg2DYNKI4iFC5KDpn7bzI4YJxg7mdsU5JgRfSNs5UILS0JgwUyFTbS9ahV7Da7GwwVs3TX_19-JodUxiIVSX6N6xdflRwi_hd6rs6cF9CgZwU
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0QPOhJDRi_3YNHW7rttqVHJBBUSkgAw43sx2xCJC0x5aC_3p1SIRoP3jZN069J-2am770h5D6Q2ABLuONHgjuca98RBgU-iVEikhCDwoZ-OooGM_48D-c18rDTwgBAST4DF5flv3ydqw22yloJ2qiitedBaKuKeKvWqmS_zEtak3Gna1_gAKkHPnOr3X_MTSlho39M0u8Tbtkib-6mkK76_OXF-N8rOiHNvUCPjnfQc0pqkDVIr0PtJ4umyJ39oI8WnjTNMzqpyNK0yPdrHIG2ojZhpa9LDTntinXVmW2SWb837Q6cakqCs7TFQeHEJoQI2oKh7Y3iNv23KYCyZVBo4ZdpocADz2iIjGGxFqgHsYGTFseB-cIEwRmpZ3kG54Qqm90pbQRn9kBSh4nwmORa-Z4AHkt9QRr4CBbrrRHGorr7y78335HDwTQdLoZPo5crcoQhQY4Va1-TevG-gRuL5oW8LYP4BacQn2U
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+International+Conference+on+Security%2C+Pattern+Analysis%2C+and+Cybernetics%EF%BC%88SPAC&rft.atitle=A+New+Memory+Based+on+Sequence+to+Sequence+Model+for+Video+Captioning&rft.au=Lin%2C+Jin-Cheng&rft.au=Zhang%2C+Chun-Yang&rft.date=2021-06-18&rft.pub=IEEE&rft.spage=470&rft.epage=476&rft_id=info:doi/10.1109%2FSPAC53836.2021.9539903&rft.externalDocID=9539903