Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (6): 49-56. doi: 10.19678/j.issn.1000-3428.0068910

• Research Hotspots and Reviews • Previous Articles     Next Articles

Medical Visual Question Answering Based on Cross-Modal Attention Feature Enhancement

LIU Kai, REN Hongyi, LI Ying, JI Yi, LIU Chunping*()   

  1. School of Computer Science and Technology, Soochow University, Suzhou 215006, Jiangsu, China
  • Received:2023-11-27 Online:2025-06-15 Published:2024-05-28
  • Contact: LIU Chunping

基于交叉模态注意力特征增强的医学视觉问答

刘凯, 任洪逸, 李蓥, 季怡, 刘纯平*()   

  1. 苏州大学计算机科学与技术学院,江苏 苏州 215006
  • 通讯作者: 刘纯平
  • 基金资助:
    国家自然科学基金(62376041); 江苏省研究生科研与实践创新计划(SJCX21_1341)

Abstract:

Medical Visual Question Answering (Med-VQA) requires an understanding of content related to both medical images and text-based questions. Therefore, designing effective modal representations and cross-modal fusion methods is crucial for performing well in Med-VQA tasks. Currently, Med-VQA methods focus only on the global features of medical images and the distribution of attention within a single modality, ignoring medical information in the local features of images and cross-modal interactions, thereby limiting the understanding of image content. This study proposes the Cross-Modal Attention-Guided Medical VQA (CMAG-MVQA) model. First, based on U-Net encoding, this method effectively enhances the local features of an image. Second, from the perspective of cross-modal collaboration, a selection guided attention method is proposed to introduce interactive information from other modalities. In addition, a self-attention mechanism is used to further enhance the image representation obtained by selective guided attention acquisition. Ablation and comparative experiments on the VQA-RAD medical question-answering dataset show that the proposed method performs well in Med-VQA tasks and improves feature representation performance compared to similar methods.

Key words: cross-modal interaction, attention mechanism, Medical Visual Question Answering (Med-VQA), feature fusion, feature enhancement

摘要:

医学视觉问答(Med-VQA)需要对医学图像内容与问题文本内容进行理解与结合,因此设计有效的模态表征及跨模态的融合方法对Med-VQA任务的表现至关重要。目前,Med-VQA方法通常只关注医学图像的全局特征以及单一模态内注意力分布,忽略了图像的局部特征所包含的医学信息与跨模态间的交互作用,从而限制了图像内容理解。针对以上问题,提出一种交叉模态注意力特征增强的Med-VQA模型(CMAG-MVQA)。基于U-Net编码有效增强图像局部特征,从交叉模态协同角度提出选择引导注意力方法,为单模态表征引入其他模态的交互信息,同时利用自注意力机制进一步增强选择引导注意力的图像表征。在VQA-RAD医学问答数据集上的消融与对比实验表明,所提方法在Med-VQA任务上有良好的表现,相比于现有同类方法,其在特征表征上性能得到较好改善。

关键词: 跨模态交互, 注意力机制, 医学视觉问答, 特征融合, 特征增强