A New Memory Based on Sequence to Sequence Model for Video Captioning
In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rap...
Saved in:
Published in | 2021 International Conference on Security, Pattern Analysis, and Cybernetics(SPAC pp. 470 - 476 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
18.06.2021
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/SPAC53836.2021.9539903 |
Cover
Abstract | In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models. |
---|---|
AbstractList | In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents from videos. It is a challenging problem due to the difficulties of understanding the objects and activities in a video. Benefit from the rapid development of deep learning technology, e.g. sequence to sequence model, video captioning task has achieved very accurate results. However, there are two serious flaws, the first is that the pre-trained deep models are often used as visual feature abstractors as the training is highly time-consuming, so the feature generalization performance generated by these pre-trained Encoder is limited when we directly employ those networks in video captioning tasks. The second is that each frame in the video is processed separately, ignoring the correlation of video data in the time dimension. In this work, we propose video captioning model with attention-memory module to explore the role of capturing temporal correlations which with sequence to sequence model as the background and showing the importance of temporal structure to vision tasks by adding the correlation of videos when extracting features and enhancing the time-memory capability. Our experiments are based on two most famous benchmark datasets in the field of video captioning: MSVD and MSR-VTT. Then employ BLEU and METEOR to evaluate the accuracy of the description generated by different methods. Finally, the experimental results confirm that the proposed model could make significant improvements in description results compared with the baseline models. |
Author | Lin, Jin-Cheng Zhang, Chun-Yang |
Author_xml | – sequence: 1 givenname: Jin-Cheng surname: Lin fullname: Lin, Jin-Cheng email: jinchengll@qq.com organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China – sequence: 2 givenname: Chun-Yang surname: Zhang fullname: Zhang, Chun-Yang email: zhangcy@fzu.edu.cn organization: Fuzhou University,College of Mathematics and Computer Science,Fuzhou,China |
BookMark | eNpFj91KwzAcRyPohZs-gSB5gdb889X2spb5AZsKU29H1vwigS2ZXUX29goOvDrn6sCZsNOUExi7JlESieZm-dJ2RtXKllJIKhujmkaoEzYha43WSsrqnM1a_oRvvsA2Dwd-6_bwPCe-xOcXUg8-5n9fZI8ND3ng79Ej887txphTTB8X7Cy4zR6XR07Z293stXso5s_3j107LyJRPRZVMLCoHZFtqNemNuKXFVkjpCbvegiI4GFDoMo7K7XEGmthCSRdUGrKrv66EcBqN8StGw6r45n6AQZUR68 |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/SPAC53836.2021.9539903 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1665443227 9781665443227 |
EndPage | 476 |
ExternalDocumentID | 9539903 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China grantid: 62076065,61751202,61751205,61572540,U1813203,U1801262 funderid: 10.13039/501100001809 – fundername: Natural Science Foundation of Fujian Province grantid: 2020J01495 funderid: 10.13039/501100003392 |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:37 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i118t-7f5e6e8a11691c4585091c71650241dace0e0fde6ff17da6242ebeb061e12af33 |
PageCount | 7 |
ParticipantIDs | ieee_primary_9539903 |
PublicationCentury | 2000 |
PublicationDate | 2021-June-18 |
PublicationDateYYYYMMDD | 2021-06-18 |
PublicationDate_xml | – month: 06 year: 2021 text: 2021-June-18 day: 18 |
PublicationDecade | 2020 |
PublicationTitle | 2021 International Conference on Security, Pattern Analysis, and Cybernetics(SPAC |
PublicationTitleAbbrev | SPAC |
PublicationYear | 2021 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.8008591 |
Snippet | In the field of computer vision, Video captioning is a very important and meaningful task, which could automatically generate textual descriptions of contents... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 470 |
SubjectTerms | Attention Module Correlation Deep learning Feature extraction NTM Representation learning Security Sequence to Sequence Supervised learning Training Video Captioning Visualization |
Title | A New Memory Based on Sequence to Sequence Model for Video Captioning |
URI | https://ieeexplore.ieee.org/document/9539903 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NS8NAEF1qT55UWvGbPXg0aTbZJM2xlpYiVAq10lvZj1koSlIkOdRf70waWxQP3pYQskmG7JudvPeGsftIUwEsk16YKOlJaUNPORL4ZM6oREMKhgr60-dkspBPy3jZYg97LQwA1OQz8GlY_8u3hamoVNbLyEaVrD2P0jTbabUa0a8Ist58Nhji5xsR8SAUfnPyj64pNWiMT9j0e7odV-TNr0rtm89fToz_vZ9T1j3I8_hsDzxnrAV5h40GHBcsPiXm7JY_IjhZXuR83lCleVkcxtQA7Z1juspf1xYKPlSbpi7bZYvx6GU48ZoeCd4atwall7oYEugrQaY3RmLyjwmAwU1QjOArrDIQQOAsJM6J1CpSg2DYNKI4iFC5KDpn7bzI4YJxg7mdsU5JgRfSNs5UILS0JgwUyFTbS9ahV7Da7GwwVs3TX_19-JodUxiIVSX6N6xdflRwi_hd6rs6cF9CgZwU |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NT8JAEN0QPOhJDRi_3YNHW7rttqVHJBBUSkgAw43sx2xCJC0x5aC_3p1SIRoP3jZN069J-2am770h5D6Q2ABLuONHgjuca98RBgU-iVEikhCDwoZ-OooGM_48D-c18rDTwgBAST4DF5flv3ydqw22yloJ2qiitedBaKuKeKvWqmS_zEtak3Gna1_gAKkHPnOr3X_MTSlho39M0u8Tbtkib-6mkK76_OXF-N8rOiHNvUCPjnfQc0pqkDVIr0PtJ4umyJ39oI8WnjTNMzqpyNK0yPdrHIG2ojZhpa9LDTntinXVmW2SWb837Q6cakqCs7TFQeHEJoQI2oKh7Y3iNv23KYCyZVBo4ZdpocADz2iIjGGxFqgHsYGTFseB-cIEwRmpZ3kG54Qqm90pbQRn9kBSh4nwmORa-Z4AHkt9QRr4CBbrrRHGorr7y78335HDwTQdLoZPo5crcoQhQY4Va1-TevG-gRuL5oW8LYP4BacQn2U |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2021+International+Conference+on+Security%2C+Pattern+Analysis%2C+and+Cybernetics%EF%BC%88SPAC&rft.atitle=A+New+Memory+Based+on+Sequence+to+Sequence+Model+for+Video+Captioning&rft.au=Lin%2C+Jin-Cheng&rft.au=Zhang%2C+Chun-Yang&rft.date=2021-06-18&rft.pub=IEEE&rft.spage=470&rft.epage=476&rft_id=info:doi/10.1109%2FSPAC53836.2021.9539903&rft.externalDocID=9539903 |