Multi-Module Co-Attention Model for Visual Question Answering

doi:10.19678/j.issn.1000-3428.0061159

Abstract

Abstract: Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.

Key words: Visual Question Answering(VQA), attention mechanism, graph attention network, relational reasoning, multimodal learning, feature fusion

摘要： 视觉问答（VQA）是计算机视觉和自然语言处理领域中典型的多模态问题，然而传统VQA模型忽略了双模态中语义信息的动态关系和不同区域间丰富的空间结构。提出一种新的多模块协同注意力模型，对视觉场景中对象间关系的动态交互和文本上下文表示进行充分理解，根据图注意力机制建模不同类型对象间关系，学习问题的自适应关系表示，将问题特征和带关系属性的视觉关系通过协同注意编码，加强问题词与对应图像区域间的依赖性，通过注意力增强模块提升模型的拟合能力。在开放数据集VQA 2.0和VQA-CP v2上的实验结果表明，该模型在“总体”、“是/否”、“计数”和“其他”类别问题上的精确度明显优于DA-NTN、ReGAT和ODA-GCN等对比方法，可有效提升视觉问答的准确率。

关键词: 视觉问答, 注意力机制, 图注意网络, 关系推理, 多模态学习, 特征融合

CLC Number:

TP391.41

ZOU Pinrong, XIAO Feng, ZHANG Wenjuan, ZHANG Wanyu, WANG Chenyang. Multi-Module Co-Attention Model for Visual Question Answering[J]. Computer Engineering, 2022, 48(2): 250-260.

邹品荣, 肖锋, 张文娟, 张万玉, 王晨阳. 面向视觉问答的多模块协同注意模型[J]. 计算机工程, 2022, 48(2): 250-260.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0061159

http://www.ecice06.com/EN/Y2022/V48/I2/250

Figures/Tables 13

References

[1] ZHENG Q T, WANG Y P.Graph self-attention network for image captioning[C]//Proceedings of the 17th International Conference on Computer Systems and Applications.Washington D.C., USA:IEEE Press, 2020:1-8.
[2] XU X, WANG T, YANG Y, et al.Cross-modal attention with semantic consistence for image-text matching[J].IEEE Transactions on Neural Networks and Learning Systems, 2020, 31(12):5412-5425.
[3] ANTOL S, AGRAWAL A, LU J, et al.VQA:visual question answering[C]//Proceedings of IEEE International Conference on Computer Vision.Santiago, Chile:IEEE Press, 2015:2425-2433.
[4] JIANG H, MISRA I, ROHRBACH M, et al.In defense of grid features for visual question answering[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10264-10273.
[5] ANDERSON P, HE X, BUEHLER C, et al.Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:6077-6086.
[6] ZHOU B, TIAN Y, SUKHBAATAR S, et al.Simple baseline for visual question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/1512.02167v2.
[7] FUKUI A, PARK D H, YANG D, et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of International IEEE Conference on Empirical Methods in Natural Language Processing.Washington D.C., USA:IEEE Press, 2016:457-468.
[8] TENEY D, ANDERSON P, HE X, et al.Tips and tricks for visual question answering:learnings from the 2017 Challenge[C]//Proceedings of 2018 IEEE Computer Society conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:4223-4232.
[9] BAI Y, FU J, ZHAO T, et al.Deep attention neural tensor network for visual question answering[C]//Proceedings of European Conference on Computer Vision.Berlin, Gwemany:Springer, 2018:21-37.
[10] CADENE R, BEN-YOUNES H, CORD M, et al.MUREL:multimodal relational reasoning for visual question answering[C]//Proceedings of 2019 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:1989-1998.
[11] CHEN K, WANG J, CHEN L C, et al.ABC-CNN:an attention based convolutional neural network for visual question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/1511.05960.
[12] YANG Z C, HE X, GAO J, et al.Stacked attention networks for image question answering[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:21-29.
[13] YU D, FU J, TIAN X, et al.Multi-source multi-level attention networks for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:4709-4717.
[14] REN S, HE K, GIRSHICK R, et al.Faster R-CNN:towards real-time object detection with region proposal networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6):1137-1149.
[15] VELICKOVIC P, CASANOVA A, LIO P, et al.Graph attention networks[C]//Proceedings of the 6th International Conference on Learning Representations.Vancouver, Canada:[s.n.], 2018:1-12.
[16] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Proceedings of NIPSʼ17.Cambridge, USA:MIT Press, 2017:5999-6009.
[17] MNIH V, HEESS N, GRAVES A, et al.Recurrent models of visual attention[C]//Proceedings of NIPSʼ14.Cambridge, USA:MIT Press, 2014:2204-2212.
[18] BAHDANAU D, CHO K H, BENGIO Y.Neural machine translation by jointly learning to align and translate[C]//Proceedings of the 3rd International Conference on Learning Representations.San Diego, USA:IEEE Press, 2015:1-15.
[19] CHOROWSKI J K, BAHDANAU D, SERDYUK D, et al..Attention-based models for speech recognition[C]//Proceedings of NIPSʼ15.Cambridge, USA:MIT Press, 2015:577-585.
[20] 闫茹玉, 刘学亮.结合自底向上注意力机制和记忆网络的视觉问答模型[J].中国图象图形学报, 2020, 25(5):993-1006. YAN R Y, LIU X L.Visual question answering model based on bottom-up attention and memory network[J].Journal of Image and Graphics, 2020, 25(5):993-1006.(in Chinese)
[21] KIM J H, ON K W, LIM W, et al.Hadamard product for low-rank bilinear pooling[C]//Proceedings of the 5th International Conference on Learning Representations.Toulon, France:[s.n.], 2017:1-14.
[22] BEN-YOUNES H, CADENE R, CORD M, et al.MUTAN:multimodal tucker fusion for visual question answering//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2631-2639.
[23] LU J, YANG J, BATRA D, et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona, Spain:[s.n.], 2016:289-297.
[24] YU Z, YU J, XIANG C, et al.Beyond bilinear:generalized multimodal factorized high-order pooling for visual question answering[J].IEEE Transactions on Neural Networks and Learning Systems, 2018, 29(12):5947-5959.
[25] KIM J H, JUN J, ZHANG B T.Bilinear attention networks[EB/OL].[2021-02-10].http://arxiv.org/abs/1805. 07932v2.
[26] WANG P, WU Q, SHEN C, et al.The VQA-machine:learning how to use existing vision algorithms to answer new questions[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:3909-3918.
[27] 白亚龙.面向图像与文本的多模态关联学习的研究与应用[D].哈尔滨:哈尔滨工业大学, 2018. BAI Y L.Research and application of image-text multimodal association learning[D].Haerbing:Harbin Institute of Technology, 2018.(in Chinese)
[28] TANG K H, ZHANG H W, WU B Y, et al.Learning to compose dynamic tree structures for visual contexts[C]//Pressings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:6612-6621.
[29] HU Z, WEI J L, HUANG Q B, et al.Graph convolutional network for visual question answering based on fine-grained question representation[C]//Pressings of the 5th IEEE International Conference on Data Science in Cyberspace.Washington D.C., USA:IEEE Press, 2020:218-224.
[30] LU P, JI L, ZHANG W, et al.R-VQA:learning visual relation facts with semantic attention for visual question answering[C]//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Boston, USA:ACM Press, 2018:1880-1889.
[31] TENEY D, LIU L, VAN DEN H A.Graph-structured representations for visual question answering[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.College Park, USA:IEEE Press, 2017:3233-3241.
[32] NORCLIFFE-BROWN W, VAFEIAS E, PARISON S.Learning conditioned graph structures for interpretable visual question answering[C]//Proceedings of NIPSʼ18.Cambridge, USA:MIT Press, 2018:8334-8343.
[33] YANG Z Q, QIN Z, YU J, et al.Multi-modal learning with prior visual relation reasoning[EB/OL].[2021-02-10].http://arxiv.org/abs/1812.09681v1.
[34] LI L J, GAN Z, CHENG Y, et al.Relation-aware graph attention network for visual question answering[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:10312-10321.
[35] ZHU X, MAO Z D, CHEN Z N. Object-difference drived graph convolutional networks for visual question answering[J].Multimedia Tools and Applications, 2020, 12:1-19.
[36] YANG Z Q, QIN Z C, YU Jet al.Prior visual relationship reasoning for visual question answering[C]//Proceedings of 2020 IEEE International Conference on Image Processing.Washington D.C., USA:IEEE Press, 2020:1411-1415.
[37] CAO Q X, LIANG X D, WANG K Z, et al.Linguistically driven graph capsule network for visual question reasoning[EB/OL].[2021-02-10] http://arxiv.org/abs/2003.10065.
[38] HUANG D, CHEN P H, ZENG R H, et al.Location-aware graph convolutional networks for video question answering[EB/OL].[2021-02-10].http://arxiv.org/abs/2008.09105.
[39] KIPF T N, WELLING M.Semi-supervised classification with graph convolutional networks[C]//Proceedings of the 5th International Conference on Learning Representations.Toulon, France:[s.n.], 2017:1-14.
[40] NARASIMHAN M, LAZEBNIK S, SCHWING A G.Out of the box:reasoning with graph convolution nets for factual visual question answering[EB/OL].[2021-02-10].https://arxiv.org/pdf/1811.00538.pdf.
[41] 于东飞.基于注意力机制与高层语义的视觉问答研究[D].合肥:中国科学技术大学, 2019. YU D F.Attention mechanism and high-level semantics for visual question answering[D].Hefei:University of Science and Technology of China, 2019.(in Chinese)
[42] KRISHNA R, ZHU Y, GROTH O, et al.Visual genome:connecting language and vision using crowdsourced dense image annotations[J].International Journal of Computer Vision, 2017, 123(1):32-73.
[43] PENNINGTON J, RICHARD S C D M.GloVe:global vectors for word representation[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.Doha, Qatar:IEEE Press, 2014:1532-1543.
[44] JIMMY B, KIROS J, HINTON G E.Layer normalization[EB/OL].[2021-02-10].http://arxiv.org/abs/1607.06450.
[45] GOYAL Y, KHOT T, AGRAWAL A, et al.Making the V in VQA matter:elevating the role of image understanding in visual question answering[J].International Journal of Computer Vision, 2017, 127(4):398-414.
[46] AGRAWAL A, BATRA D, PARIKH D, et al.Don't just assume;Look and answer:overcoming priors for visual question answering[C]//Proceedings of 2018 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.Salt Lake City, USA:IEEE Press, 2018:4971-4980.
[47] LIN T Y, MAIRE M, BELONGIE S, et al.Microsoft COCO:common objects in context[C]//Proceedings of PARTʼ14.Zurich, Switzerland:Springer, 2014:740-755.
[48] GOYAL P, DOLLAR P, GIRSHICK R, et al.Accurate, large minibatch SGD:training imagenet in 1 hour.[EB/OL].[2021-02-10].http://arxiv.org/abs/1706.02677.
[49] MALINOWSKI M, DOERCH C, SANTORO A, et al.Learning visual question answering by bootstrapping hard attentiont[C]//Proceedings of LNCS'18.Zurich, Switzerland:Springer, 2018:3-20.

Please choose a citation manager

Content to export