[1] ZHENG Q T, WANG Y P.Graph self-attention network for image captioning[C]//Proceedings of the 17th International Conference on Computer Systems and Applications.Washington D.C., USA:IEEE Press, 2020:1-8. [2] XU X, WANG T, YANG Y, et al.Cross-modal attention with semantic consistence for image-text matching[J].IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12):5412-5425. [3] ANTOL S, AGRAWAL A, LU J, et al.VQA:visual question answering[C]//Proceedings of IEEE International Conference on Computer Vision.Santiago, Chile:IEEE Press, 2015:2425-2433. [4] JIANG H, MISRA I, ROHRBACH M, et al.In defense of grid features for visual question answering[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10264-10273. [5] ANDERSON P, HE X, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:6077-6086. [6] ZHOU B, TIAN Y, SUKHBAATAR S, et al.Simple baseline for visual question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/1512.02167v2. [7] FUKUI A, PARK D H, YANG D, et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of International IEEE Conference on Empirical Methods in Natural Language Processing.Washington D.C., USA:IEEE Press, 2016:457-468. [8] TENEY D, ANDERSON P, HE X, et al.Tips and tricks for visual question answering:learnings from the 2017 Challenge[C]//Proceedings of 2018 IEEE Computer Society conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:4223-4232. [9] BAI Y, FU J, ZHAO T, et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of European Conference on Computer Vision.Berlin, Gwemany:Springer, 2018:21-37. [10] CADENE R, BEN-YOUNES H, CORD M, et al.MUREL:multimodal relational reasoning for visual question answering[C]//Proceedings of 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:1989-1998. [11] CHEN K, WANG J, CHEN L C, et al.ABC-CNN:an attention based convolutional neural network for visual question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/1511.05960. [12] YANG Z C, HE X, GAO J, et al.Stacked attention networks for image question answering[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:21-29. [13] YU D, FU J, TIAN X, et al.Multi-source multi-level attention networks for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:4709-4717. [14] REN S, HE K, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149. [15] VELICKOVIC P, CASANOVA A, LIO P, et al.Graph attention networks[C]//Proceedings of the 6th International Conference on Learning Representations.Vancouver, Canada:[s.n.], 2018:1-12. [16] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of NIPSʼ17.Cambridge, USA:MIT Press, 2017:5999-6009. [17] MNIH V, HEESS N, GRAVES A, et al.Recurrent models of visual attention[C]//Proceedings of NIPSʼ14.Cambridge, USA:MIT Press, 2014:2204-2212. [18] BAHDANAU D, CHO K H, BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learning Representations.San Diego, USA:IEEE Press, 2015:1-15. [19] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al..Attention-based models for speech recognition[C]//Proceedings of NIPSʼ15.Cambridge, USA:MIT Press, 2015:577-585. [20] 闫茹玉, 刘学亮.结合自底向上注意力机制和记忆网络的视觉问答模型[J].中国图象图形学报, 2020, 25(5):993-1006. YAN R Y, LIU X L.Visual question answering model based on bottom-up attention and memory network[J].Journal of Image and Graphics, 2020, 25(5):993-1006.(in Chinese) [21] KIM J H, ON K W, LIM W, et al.Hadamard product for low-rank bilinear pooling[C]//Proceedings of the 5th International Conference on Learning Representations.Toulon, France:[s.n.], 2017:1-14. [22] BEN-YOUNES H, CADENE R, CORD M, et al.MUTAN:multimodal tucker fusion for visual question answering//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2631-2639. [23] LU J, YANG J, BATRA D, et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona, Spain:[s.n.], 2016:289-297. [24] YU Z, YU J, XIANG C, et al.Beyond bilinear:generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12):5947-5959. [25] KIM J H, JUN J, ZHANG B T.Bilinear attention networks[EB/OL].[2021-02-10].http://arxiv.org/abs/1805. 07932v2. [26] WANG P, WU Q, SHEN C, et al.The VQA-machine:learning how to use existing vision algorithms to answer new questions[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:3909-3918. [27] 白亚龙.面向图像与文本的多模态关联学习的研究与应用[D].哈尔滨:哈尔滨工业大学, 2018. BAI Y L.Research and application of image-text multimodal association learning[D].Haerbing:Harbin Institute of Technology, 2018.(in Chinese) [28] TANG K H, ZHANG H W, WU B Y, et al.Learning to compose dynamic tree structures for visual contexts[C]//Pressings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6612-6621. [29] HU Z, WEI J L, HUANG Q B, et al.Graph convolutional network for visual question answering based on fine-grained question representation[C]//Pressings of the 5th IEEE International Conference on Data Science in Cyberspace.Washington D.C., USA:IEEE Press, 2020:218-224. [30] LU P, JI L, ZHANG W, et al.R-VQA:learning visual relation facts with semantic attention for visual question answering[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston, USA:ACM Press, 2018:1880-1889. [31] TENEY D, LIU L, VAN DEN H A.Graph-structured representations for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:3233-3241. [32] NORCLIFFE-BROWN W, VAFEIAS E, PARISON S.Learning conditioned graph structures for interpretable visual question answering[C]//Proceedings of NIPSʼ18.Cambridge, USA:MIT Press, 2018:8334-8343. [33] YANG Z Q, QIN Z, YU J, et al.Multi-modal learning with prior visual relation reasoning[EB/OL].[2021-02-10].http://arxiv.org/abs/1812.09681v1. [34] LI L J, GAN Z, CHENG Y, et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:10312-10321. [35] ZHU X, MAO Z D, CHEN Z N. Object-difference drived graph convolutional networks for visual question answering[J].Multimedia Tools and Applications, 2020, 12:1-19. [36] YANG Z Q, QIN Z C, YU Jet al.Prior visual relationship reasoning for visual question answering[C]//Proceedings of 2020 IEEE International Conference on Image Processing.Washington D.C., USA:IEEE Press, 2020:1411-1415. [37] CAO Q X, LIANG X D, WANG K Z, et al.Linguistically driven graph capsule network for visual question reasoning[EB/OL].[2021-02-10] http://arxiv.org/abs/2003.10065. [38] HUANG D, CHEN P H, ZENG R H, et al.Location-aware graph convolutional networks for video question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/2008.09105. [39] KIPF T N, WELLING M.Semi-supervised classification with graph convolutional networks[C]//Proceedings of the 5th International Conference on Learning Representations.Toulon, France:[s.n.], 2017:1-14. [40] NARASIMHAN M, LAZEBNIK S, SCHWING A G.Out of the box:reasoning with graph convolution nets for factual visual question answering[EB/OL].[2021-02-10].https://arxiv.org/pdf/1811.00538.pdf. [41] 于东飞.基于注意力机制与高层语义的视觉问答研究[D].合肥:中国科学技术大学, 2019. YU D F.Attention mechanism and high-level semantics for visual question answering[D].Hefei:University of Science and Technology of China, 2019.(in Chinese) [42] KRISHNA R, ZHU Y, GROTH O, et al.Visual genome:connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision, 2017, 123(1):32-73. [43] PENNINGTON J, RICHARD S C D M.GloVe:global vectors for word representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.Doha, Qatar:IEEE Press, 2014:1532-1543. [44] JIMMY B, KIROS J, HINTON G E.Layer normalization[EB/OL].[2021-02-10].http://arxiv.org/abs/1607.06450. [45] GOYAL Y, KHOT T, AGRAWAL A, et al.Making the V in VQA matter:elevating the role of image understanding in visual question answering[J].International Journal of Computer Vision, 2017, 127(4):398-414. [46] AGRAWAL A, BATRA D, PARIKH D, et al.Don't just assume;Look and answer:overcoming priors for visual question answering[C]//Proceedings of 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Salt Lake City, USA:IEEE Press, 2018:4971-4980. [47] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:common objects in context[C]//Proceedings of PARTʼ14.Zurich, Switzerland:Springer, 2014:740-755. [48] GOYAL P, DOLLAR P, GIRSHICK R, et al.Accurate, large minibatch SGD:training imagenet in 1 hour.[EB/OL].[2021-02-10].http://arxiv.org/abs/1706.02677. [49] MALINOWSKI M, DOERCH C, SANTORO A, et al.Learning visual question answering by bootstrapping hard attentiont[C]//Proceedings of LNCS'18.Zurich, Switzerland:Springer, 2018:3-20. |