Sparse Adversarial Examples Attacking on Video Captioning Model

Despite the fact that multi-modal deep learning such as image captioning model has been proved to be vulnerable to adversarial examples, the adversarial susceptibility in video caption generation is under-examined.There are two main reasons for this.On the one hand, the video captioning model input...

Full description

Saved in:
Bibliographic Details
Published inJi suan ji ke xue Vol. 50; no. 12; pp. 330 - 336
Main Authors Qiu, Jiangxing, Tang, Xueming, Wang, Tianmei, Wang, Chen, Cui, Yongquan, Luo, Ting
Format Journal Article
LanguageChinese
Published Chongqing Guojia Kexue Jishu Bu 01.12.2023
Editorial office of Computer Science
Subjects
Online AccessGet full text
ISSN1002-137X
DOI10.11896/jsjkx.221100068

Cover

More Information
Summary:Despite the fact that multi-modal deep learning such as image captioning model has been proved to be vulnerable to adversarial examples, the adversarial susceptibility in video caption generation is under-examined.There are two main reasons for this.On the one hand, the video captioning model input is a stream of images rather than a single picture in contrast to image captioning systems.The calculation would be enormous if we perturb each frame of a video.On the other hand, compared with the video recognition model, the output of the model is not a single word, but a more complex semantic description.To solve the above problems and study the robustness of video captioning model, this paper proposes a sparse adversarial attack method.Firstly, a method is proposed based on the idea derived from saliency maps in image object recognition model to verify the contribution of different frames to the video captioning model output and a L norm based optimistic objective function suited for video caption models is des
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1002-137X
DOI:10.11896/jsjkx.221100068