[1] |
ANTOL S,AGRAWAL A,LU J,et al.VQA:visual question answering[J].International Journal of Computer Vision,2017,123(1):4-31.
|
[2] |
TURNEY P,PANTEL P.From frequency to meaning:vector space models of semantics[J].Journal of Artificial Intelligence Research,2010,37(1):141-188.
|
[3] |
MIKOLOV T,CHEN K,CORRADO G,et al.Efficient estimation of word representations in vector space[EB/OL].[2019-11-10].http://export.arxiv.org/pdf/1301.3781.
|
[4] |
DEVLIN J,CHANG M W,LEE K,et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2019-11-10].https://tooob.com/api/objs/read/noteid/28717995/.
|
[5] |
YANG Zhilin,DAI Zihang,YANG Yiming,et al.XLNet:generalized autoregressive pretraining for language understanding[EB/OL].[2019-11-10].https://arxiv.org/abs/1906.08237.
|
[6] |
ZHOU B L,TIAN Y D,SUKHBAATAR S,et al.Simple baseline for visual question answering[EB/OL].[2019-11-10].http://de.arxiv.org/pdf/1512.02167.
|
[7] |
SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-11-10].https://arxiv.org/abs/1409.1556.
|
[8] |
HOCHREITER S,SCHMIDHUBER J.Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
|
[9] |
LIN T Y,MAIRE M,BELONGIE S,et al.Microsoft coco:common objects in context[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2014:740-755.
|
[10] |
VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[EB/OL].[2019-11-10].https://arxiv.org/abs/1706.03762.
|
[11] |
XU H,SAENKO K.Ask,attend and answer:exploring question-guided spatial attention for visual question answering[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2016:156-163.
|
[12] |
ANDERSON P,HE X,BUEHLER C,et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2018:6077-6086.
|
[13] |
REN S,HE K,GIRSHICK R,et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(6):1137-1149.
|
[14] |
JANG Y,SONG Y L,YU Y,et al.TGIF-QA:toward spatio-temporal reasoning in visual question answering[EB/OL].[2019-11-10].https://arxiv.org/pdf/1704.04497.pdf.
|
[15] |
TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[EB/OL].[2019-11-10].https://arxiv.org/abs/1412.0767.
|
[16] |
HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:770-778.
|
[17] |
MU J Q,BHAT S M,VISWANATH P.All-but-the-top:simple and effective postprocessing for word representa-tions[EB/OL].[2019-11-10].https://arxiv.org/abs/1702.01417.
|
[18] |
DENG J,DONG W,SOCHER R,et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recog-nition.Washington D.C.,USA:IEEE Press,2009:45-69.
|
[19] |
SIMONYAN K,ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2019-11-10].https://arxiv.org/abs/1409.1556.
|
[20] |
SZEGEDY C,LIU N W,JIA N Y,et al.Going deeper with convolutions[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:12-26.
|
[21] |
CHUNG J,GULCEHRE C,CHO K,et al.Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL].[2019-11-10].https://arxiv.org/abs/1412.3555.
|
[22] |
CHU W,XUE H,ZHAO Z,et al.The forgettable-watcher model for video question answering[J].Neurocomputing,2018,314:386-393.
|
[23] |
WANG Bo,XU Youjiang,HAN Yahong,et al.Movie question answering:remembering the textual cues for layered visual contents[EB/OL].[2019-11-10].https://arxiv.org/pdf/1804.09412.pdf.
|
[24] |
LEI J,YU L,BANSAL M,et al.Tvqa:localized,compositional video question answering[EB/OL].[2019-11-10].https://www.aclweb.org/anthology/D18-1167.pdf.
|
[25] |
ZHANG Jing,CHEN Qingkui.Analysis of crowd congestion degree in narrow space based on attention mechanism[J].Computer Engineering,2020,46(9):254-260,267.(in Chinese)张菁,陈庆奎.基于注意力机制的狭小空间人群拥挤度分析[J].计算机工程,2020,46(9):254-260,267.
|
[26] |
LI Yachao,XIONG Deyi,ZHANG Min.A survey of neural machine translation[J].Chinese Journal of Computers,2018,41(12):2734-2755.(in Chinese)李亚超,熊德意,张民.神经机器翻译综述[J].计算机学报,2018,41(12):2734-2755.
|
[27] |
YU Y,KIM J,KIM G.A joint sequence fusion model for video question answering and retrieval[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2018:471-487.
|
[28] |
YE Yunan,ZHAO Zhou,LI Yimeng,et al.Video question answering via attribute-augmented attention network learning[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2017:829-832.
|
[29] |
XU Dejing,ZHAO Zhou,XIAO Jun,et al.Video question answering via gradually refined attention over appearance and motion[C]//Proceedings of the 25th ACM International Conference on Multimedia.New York,USA:ACM Press,2017:1645-1653.
|
[30] |
LIANG Lili.Research on video question answering based on deep learning method[D].Harbin:Harbin University of Science and Technology,2019.(in Chinese)梁丽丽.基于深度学习方法的视频问答研究[D].哈尔滨:哈尔滨理工大学,2019.
|
[31] |
YAO L,TORABI A,CHO K,et al.Describing videos by exploiting temporal structure[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:4507-4515.
|
[32] |
DONAHUE J,HENDRICKS L A,ROHRBACH M,et al.Long-term recurrent convolutional networks for visual recognition and description[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,39(4):677-691.
|
[33] |
SUN C,MYERS A,VONDRICK C,et al.Videobert:a joint model for video and language representation learning[EB/OL].[2019-11-10].https://arxiv.org/pdf/1904.01766.pdf.
|