作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 268-277. doi: 10.19678/j.issn.1000-3428.0070013

• 多模态与信息融合 • 上一篇    下一篇

基于语境感知和多层次特征融合的医学视觉问答模型

陈俊, 吴晓红*(), 陈洪刚, 何小海   

  1. 四川大学电子信息学院, 四川 成都 610065
  • 收稿日期:2024-06-17 修回日期:2024-08-10 出版日期:2026-06-15 发布日期:2024-12-10
  • 通讯作者: 吴晓红
  • 作者简介:

    陈俊(CCF学生会员), 男, 硕士研究生, 主研方向为自然语言处理

    吴晓红(通信作者), 副教授

    陈洪刚, 副研究员

    何小海, 教授

Medical Visual Question Answering Model Based on Context Awareness and Multi-level Feature Fusion

CHEN Jun, WU Xiaohong*(), CHEN Honggang, HE Xiaohai   

  1. School of Electronic Information, Sichuan University, Chengdu 610065, Sichuan, China
  • Received:2024-06-17 Revised:2024-08-10 Online:2026-06-15 Published:2024-12-10
  • Contact: WU Xiaohong

摘要:

医学视觉问答(Med-VQA)旨在根据给定的医学图像和相关问题预测准确答案。这项任务需要同时提取问题特征与医学图像特征, 并对2种特征进行融合, 得到最终答案。现有Med-VQA方法主要关注整体特征上的提取与交互, 无法有效捕获问题与图像关键区域之间的相关性, 缺乏对细粒度图像信息的理解能力。针对该问题, 提出一种基于语境感知和多层次特征融合的医学视觉问答模型CAMF, 充分关注细粒度图像特征并进行多层次特征交互。该模型首先通过2种引导注意力(GA)增强文本特征和图像特征, 然后利用语境感知模块捕获关键的细粒度图像特征, 最后通过多层次特征融合实现3种特征的相互促进, 获取更有效的特征进行答案预测。实验结果表明, 该模型在VQA-RAD数据集上的整体准确率比同类型的基线模型高出1.5百分点, 在SLAKE数据集上的整体准确率比同类型的基线模型高出0.4百分点, 且在两个数据集上均取得了与医学领域的预训练方法相当的水平, 同时通过特征图可视化结果可以看出, 该模型能够有效关注图像中的关键区域, 充分利用图像信息获取答案。

关键词: 医学视觉问答, 多层次特征融合, 语境感知, 引导注意力, 多模态

Abstract:

Medical Visual Question Answering (Med-VQA) aims to accurately predict answers based on medical images and related questions. This task requires the simultaneous extraction of problem features and medical image features and fusing two features to obtain the final answer. Existing Med-VQA methods mainly focus on the extraction and interaction of overall features, which cannot effectively capture the correlation between questions and key areas of an image and lack the ability to understand fine-grained image information. To address this problem, this study proposes a model based on context awareness and multi-level feature fusion for Med-VQA, known as CAMF, which fully focuses on fine-grained image features and performs multi-level feature interaction. The model first enhances text and image features through two types of Guided Attention (GA), then uses the context awareness module to capture key fine-grained image information featrue, and finally realizes the mutual promotion of three features through multi-level feature fusion to obtain more effective features for answer prediction. The experimental results show that the overall accuracy of the CAMF model on the VQA-RAD dataset is 1.5 percentage points higher than that of the baseline model of the same type and that the overall accuracy on the SLAKE dataset is 0.4 percentage points higher than that of the baseline model of the same type. Moreover, it achieves a performance comparable to that of medical domain pre-training methods on both datasets. At the same time, it can be seen from the feature map visualization results that the CAMF model can effectively focus on key areas in the image and make full use of image information to obtain answers.

Key words: Medical Visual Question Answering (Med-VQA), multi-level feature fusion, context awareness, Guided Attention (GA), multi-modality