基于先验MASK注意力机制的视频问答方案

doi:10.19678/j.issn.1000-3428.0056867

摘要/Abstract

摘要： 视频问答是深度学习领域的研究热点之一，广泛应用于安防和广告等系统中。在注意力机制框架下，建立先验MASK注意力机制模型，使用Faster R-CNN模型提取视频关键帧以及视频中的对象标签，将其与问题文本特征进行3种注意力加权，利用MASK屏蔽与问题无关的答案，从而增强模型的可解释性。实验结果表明，该模型在视频问答任务中的准确率达到61%，与VQA+、SA+等视频问答模型相比，其具有更快的预测速度以及更好的预测效果。

关键词: 视频问答, 计算机视觉, 自然语言处理, 注意力机制, MASK模型

Abstract: Video Question Answering (Video QA) is one of the research hotspots in deep learning. It is widely used in security and advertising systems. In the framework of attention mechanism,this paper proposes a priori MASK attention mechanism model. The key frames of the video and the labels of the objects in the video are extracted by using the Faster R-CNN model,and three types of attention weighting are performed on them and the text features of the question. Then MASK is used to mask the answers that have nothing to do with the question,which enhances the interpretability of the model. Experimental results show that the accuracy of the proposed model reaches 61% in Video QA tasks,and the model outperforms the existing Video QA models such as VQA+ and SA+ in terms of prediction speed and prediction performance.

Key words: Video Question Answering(Video QA), computer vision, natural language processing, attention mechanism, MASK model

中图分类号:

TP81

许振雷, 董洪伟. 基于先验MASK注意力机制的视频问答方案[J]. 计算机工程, 2021, 47(2): 52-59.

XU Zhenlei, DONG Hongwei. Video Question Answering Scheme Based on Prior MASK Attention Mechanism[J]. Computer Engineering, 2021, 47(2): 52-59.

https://www.ecice06.com/CN/Y2021/V47/I2/52

图/表 15

20210225091001

20210225091005

20210225091008

20210225091012

20210225091015

20210225091018

20210225091021

20210225091024

20210225091027

20210225091030

20210225091033

20210225091037

20210225091040

20210225091043

20210225091046

参考文献 33

[1]	ANTOL S,AGRAWAL A,LU J,et al.VQA:visual question answering[J].International Journal of Computer Vision,2017,123(1):4-31.
[2]	TURNEY P,PANTEL P.From frequency to meaning:vector space models of semantics[J].Journal of Artificial Intelligence Research,2010,37(1):141-188.
[3]	MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[EB/OL].[2019-11-10].http://export.arxiv.org/pdf/1301.3781.
[4]	DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2019-11-10].https://tooob.com/api/objs/read/noteid/28717995/.
[5]	YANG Zhilin,DAI Zihang,YANG Yiming,et al.XLNet:generalized autoregressive pretraining for language understanding[EB/OL].[2019-11-10].https://arxiv.org/abs/1906.08237.
[6]	ZHOU B L,TIAN Y D,SUKHBAATAR S,et al.Simple baseline for visual question answering[EB/OL].[2019-11-10].http://de.arxiv.org/pdf/1512.02167.
[7]	SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-11-10].https://arxiv.org/abs/1409.1556.
[8]	HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
[9]	LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2014:740-755.
[10]	VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[EB/OL].[2019-11-10].https://arxiv.org/abs/1706.03762.
[11]	XU H,SAENKO K.Ask,attend and answer:exploring question-guided spatial attention for visual question answering[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2016:156-163.
[12]	ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2018:6077-6086.
[13]	REN S,HE K,GIRSHICK R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
[14]	JANG Y,SONG Y L,YU Y,et al.TGIF-QA:toward spatio-temporal reasoning in visual question answering[EB/OL].[2019-11-10].https://arxiv.org/pdf/1704.04497.pdf.
[15]	TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[EB/OL].[2019-11-10].https://arxiv.org/abs/1412.0767.
[16]	HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:770-778.
[17]	MU J Q,BHAT S M,VISWANATH P.All-but-the-top:simple and effective postprocessing for word representa-tions[EB/OL].[2019-11-10].https://arxiv.org/abs/1702.01417.
[18]	DENG J,DONG W,SOCHER R,et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recog-nition.Washington D.C.,USA:IEEE Press,2009:45-69.
[19]	SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-11-10].https://arxiv.org/abs/1409.1556.
[20]	SZEGEDY C,LIU N W,JIA N Y,et al.Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:12-26.
[21]	CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].[2019-11-10].https://arxiv.org/abs/1412.3555.
[22]	CHU W,XUE H,ZHAO Z,et al.The forgettable-watcher model for video question answering[J].Neurocomputing,2018,314:386-393.
[23]	WANG Bo,XU Youjiang,HAN Yahong,et al.Movie question answering:remembering the textual cues for layered visual contents[EB/OL].[2019-11-10].https://arxiv.org/pdf/1804.09412.pdf.
[24]	LEI J,YU L,BANSAL M,et al.Tvqa:localized,compositional video question answering[EB/OL].[2019-11-10].https://www.aclweb.org/anthology/D18-1167.pdf.
[25]	ZHANG Jing,CHEN Qingkui.Analysis of crowd congestion degree in narrow space based on attention mechanism[J].Computer Engineering,2020,46(9):254-260,267.(in Chinese)张菁,陈庆奎.基于注意力机制的狭小空间人群拥挤度分析[J].计算机工程,2020,46(9):254-260,267.
[26]	LI Yachao,XIONG Deyi,ZHANG Min.A survey of neural machine translation[J].Chinese Journal of Computers,2018,41(12):2734-2755.(in Chinese)李亚超,熊德意,张民.神经机器翻译综述[J].计算机学报,2018,41(12):2734-2755.
[27]	YU Y,KIM J,KIM G.A joint sequence fusion model for video question answering and retrieval[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2018:471-487.
[28]	YE Yunan,ZHAO Zhou,LI Yimeng,et al.Video question answering via attribute-augmented attention network learning[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2017:829-832.
[29]	XU Dejing,ZHAO Zhou,XIAO Jun,et al.Video question answering via gradually refined attention over appearance and motion[C]//Proceedings of the 25th ACM International Conference on Multimedia.New York,USA:ACM Press,2017:1645-1653.
[30]	LIANG Lili.Research on video question answering based on deep learning method[D].Harbin:Harbin University of Science and Technology,2019.(in Chinese)梁丽丽.基于深度学习方法的视频问答研究[D].哈尔滨:哈尔滨理工大学,2019.
[31]	YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:4507-4515.
[32]	DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,39(4):677-691.
[33]	SUN C,MYERS A,VONDRICK C,et al.Videobert:a joint model for video and language representation learning[EB/OL].[2019-11-10].https://arxiv.org/pdf/1904.01766.pdf.

选择文件类型/文献管理软件名称

选择包含的内容