作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (2): 250-260. doi: 10.19678/j.issn.1000-3428.0061159

• 图形图像处理 • 上一篇    下一篇

面向视觉问答的多模块协同注意模型

邹品荣1, 肖锋2, 张文娟3, 张万玉2, 王晨阳2   

  1. 1. 西安工业大学 兵器科学与技术学院, 西安 710021;
    2. 西安工业大学 计算机科学与工程学院, 西安 710021;
    3. 西安工业大学 基础学院, 西安 710021
  • 收稿日期:2021-03-16 修回日期:2021-05-25 发布日期:2021-05-27
  • 作者简介:邹品荣(1997-),男,硕士研究生,主研方向为深度学习、视觉问答;肖锋(通信作者),教授;张文娟,副教授;张万玉、王晨阳,硕士研究生。
  • 基金资助:
    国家自然科学基金(61572392,62171361);陕西省科技计划项目(2020GY-066);陕西省自然科学基础研究项目(2021JM-440);西安市未央区科技计划项目(201925)。

Multi-Module Co-Attention Model for Visual Question Answering

ZOU Pinrong1, XIAO Feng2, ZHANG Wenjuan3, ZHANG Wanyu2, WANG Chenyang2   

  1. 1. School of Armament Science and Technology, Xi'an Technological University, Xi'an 710021, China;
    2. School of Computer Science and Engineering, Xi'an Technological University, Xi'an 710021, China;
    3. School of Science, Xi'an Technological University, Xi'an 710021, China
  • Received:2021-03-16 Revised:2021-05-25 Published:2021-05-27

摘要: 视觉问答(VQA)是计算机视觉和自然语言处理领域中典型的多模态问题,然而传统VQA模型忽略了双模态中语义信息的动态关系和不同区域间丰富的空间结构。提出一种新的多模块协同注意力模型,对视觉场景中对象间关系的动态交互和文本上下文表示进行充分理解,根据图注意力机制建模不同类型对象间关系,学习问题的自适应关系表示,将问题特征和带关系属性的视觉关系通过协同注意编码,加强问题词与对应图像区域间的依赖性,通过注意力增强模块提升模型的拟合能力。在开放数据集VQA 2.0和VQA-CP v2上的实验结果表明,该模型在“总体”、“是/否”、“计数”和“其他”类别问题上的精确度明显优于DA-NTN、ReGAT和ODA-GCN等对比方法,可有效提升视觉问答的准确率。

关键词: 视觉问答, 注意力机制, 图注意网络, 关系推理, 多模态学习, 特征融合

Abstract: Visual Question Answering(VQA) is a typical multi-modal problem in computer vision and natural language processing.Most of the existing VQA models ignore the dynamic relationships of semantic information between two modes and the rich spatial structure of an image.For this reason, the paper proposes a novel multi-module Co-Attention Network named MMCAN, which can fully understand the dynamic interactions between objects and contextual text representation in a visual scenario.Based on the graph attention mechanism, relations between different types of objects are modeled.The adaptive relation representation of the problem is learnt, and the visual object relations as well as the problem features are encoded through co-attention to strengthen the dependence between the words and corresponding image areas.Finally, the enhancement module is used to improve the fitting ability of the model. Experimental results on the open data set VQA 2.0 and VQA-CP V2 show that the accuracy of the proposed model is significantly better than that of DA-NTN, ReGAT and ODA-GCN for "total", "yes/no", "count" and "other" categories of questions.It can effectively improve the accuracy of visual question answering.

Key words: Visual Question Answering(VQA), attention mechanism, graph attention network, relational reasoning, multimodal learning, feature fusion

中图分类号: