作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (9): 96-104. doi: 10.19678/j.issn.1000-3428.0062339

• 人工智能与模式识别 • 上一篇    下一篇

基于空间关系与频率特征的视觉问答模型

付鹏程1,2, 杨关1,2, 刘小明1,2, 刘阳3, 张紫明1,2, 成曦1,2   

  1. 1. 中原工学院 计算机学院, 郑州 450007;
    2. 河南省网络舆情监测与智能分析重点实验室, 郑州 450007;
    3. 西安电子科技大学 通信工程学院, 西安 710071
  • 收稿日期:2021-08-12 修回日期:2021-10-04 发布日期:2021-10-19
  • 作者简介:付鹏程(1993—),男,硕士研究生,主研方向为机器学习、图像处理;杨关(通信作者),副教授、博士;刘小明、刘阳,讲师、博士;张紫明、成曦,硕士研究生。
  • 基金资助:
    国家自然科学基金(61772576,61906141);陕西省自然科学基金(2020JQ-317);河南省科技厅科技攻关计划(182102210126)。

Visual Question Answering Model Based on Spatial Relation and Frequency Feature

FU Pengcheng1,2, YANG Guan1,2, LIU Xiaoming1,2, LIU Yang3, ZHANG Ziming1,2, CHENG Xi1,2   

  1. 1. School of Computer Science, Zhongyuan University of Technology, Zhengzhou 450007, China;
    2. Henan Key Laboratory on Public Opinion Intelligent Analysis, Zhengzhou 450007, China;
    3. School of Telecommunications Engineering, Xidian University, Xi'an 710071, China
  • Received:2021-08-12 Revised:2021-10-04 Published:2021-10-19

摘要: 视觉问答作为多模态数据处理中的重要任务,需要将不同模态的信息进行关联表示。现有视觉问答模型无法有效区分相似目标对象且对于目标对象之间的空间关系表达不准确,从而影响模型整体性能。为充分利用视觉问答图像和问题中的细粒度信息与空间关系信息,基于自底向上和自顶向下的注意力(BUTD)模型及模块化协同注意力网络(MCAN)模型,结合空间域特征和频率域特征构造多维增强注意力(BUDR)模型和模块化共同增强注意力网络(MCDR)模型。利用离散余弦变换得到频率信息,改善图像细节丢失问题。采用关系网络学习空间结构信息和潜在关系信息,减少图像和问题特征出现对齐错误,并加强模型推理能力。在VQA v2.0数据集和test-dev验证集上的实验结果表明,BUDR和MCDR模型能够增强图像细粒度识别性能,提高图像和问题目标对象间的关联性,相比于BUTD和MCAN模型预测精确率分别提升了0.14和0.25个百分点。

关键词: 离散余弦变换, 细粒度识别, 关系网络, 注意力机制, 特征融合

Abstract: As an important task in multimodal data processing, Visual Question Answering(VQA) needs to associate and represent information from different modalities.However, existing VQA models can not effectively distinguish similar target objects and can not accurately express the spatial relationship between target objects, thus affecting the model's overall performance.They also have the problem of low recognition of similar objects and wrongly expressing the spatial relationship between target objects.To fully exploit fine-grained and spatial relationship information in images and questions of VQA, this study combines spatial domain and frequency domain features with the Bottom-Up and Top-Down attention(BUTD) model and Modular Co-Attention Network(MCAN) model to construct a multi-dimensional enhanced attention model, called BUDR, and a modular co-enhanced attention network model, called MCDR.BUDR and MCDR models use Discrete Cosine Transform(DCT) to obtain frequency information to improve the image detail loss problem, and Relation Network(RN) to learn spatial structure information and latent relational information to reduce the misalignment of image and question features, and enhance model reasoning capabilities.The experimental results on the VQA v2.0 dataset and the test-dev validation set show that the BUDR and MCDR models can enhance the performance of fine-grained image recognition and improve the correlation between the image and the target object of the question.Compared with the BUTD and MCAN models, the prediction accuracy of the BUDR and MCDR models is increased by 0.14 and 0.25 percentage points, respectively.

Key words: Discrete Cosine Transform(DCT), fine-grained identification, Relation Network(RN), attention mechanism, feature fusion

中图分类号: