基于语境感知和多层次特征融合的医学视觉问答模型

doi:10.19678/j.issn.1000-3428.0070013

摘要/Abstract

摘要：

医学视觉问答(Med-VQA)旨在根据给定的医学图像和相关问题预测准确答案。这项任务需要同时提取问题特征与医学图像特征, 并对2种特征进行融合, 得到最终答案。现有Med-VQA方法主要关注整体特征上的提取与交互, 无法有效捕获问题与图像关键区域之间的相关性, 缺乏对细粒度图像信息的理解能力。针对该问题, 提出一种基于语境感知和多层次特征融合的医学视觉问答模型CAMF, 充分关注细粒度图像特征并进行多层次特征交互。该模型首先通过2种引导注意力(GA)增强文本特征和图像特征, 然后利用语境感知模块捕获关键的细粒度图像特征, 最后通过多层次特征融合实现3种特征的相互促进, 获取更有效的特征进行答案预测。实验结果表明, 该模型在VQA-RAD数据集上的整体准确率比同类型的基线模型高出1.5百分点, 在SLAKE数据集上的整体准确率比同类型的基线模型高出0.4百分点, 且在两个数据集上均取得了与医学领域的预训练方法相当的水平, 同时通过特征图可视化结果可以看出, 该模型能够有效关注图像中的关键区域, 充分利用图像信息获取答案。

关键词: 医学视觉问答, 多层次特征融合, 语境感知, 引导注意力, 多模态

Abstract:

Medical Visual Question Answering (Med-VQA) aims to accurately predict answers based on medical images and related questions. This task requires the simultaneous extraction of problem features and medical image features and fusing two features to obtain the final answer. Existing Med-VQA methods mainly focus on the extraction and interaction of overall features, which cannot effectively capture the correlation between questions and key areas of an image and lack the ability to understand fine-grained image information. To address this problem, this study proposes a model based on context awareness and multi-level feature fusion for Med-VQA, known as CAMF, which fully focuses on fine-grained image features and performs multi-level feature interaction. The model first enhances text and image features through two types of Guided Attention (GA), then uses the context awareness module to capture key fine-grained image information featrue, and finally realizes the mutual promotion of three features through multi-level feature fusion to obtain more effective features for answer prediction. The experimental results show that the overall accuracy of the CAMF model on the VQA-RAD dataset is 1.5 percentage points higher than that of the baseline model of the same type and that the overall accuracy on the SLAKE dataset is 0.4 percentage points higher than that of the baseline model of the same type. Moreover, it achieves a performance comparable to that of medical domain pre-training methods on both datasets. At the same time, it can be seen from the feature map visualization results that the CAMF model can effectively focus on key areas in the image and make full use of image information to obtain answers.

Key words: Medical Visual Question Answering (Med-VQA), multi-level feature fusion, context awareness, Guided Attention (GA), multi-modality

陈俊, 吴晓红, 陈洪刚, 何小海. 基于语境感知和多层次特征融合的医学视觉问答模型[J]. 计算机工程, 2026, 52(6): 268-277.

CHEN Jun, WU Xiaohong, CHEN Honggang, HE Xiaohai. Medical Visual Question Answering Model Based on Context Awareness and Multi-level Feature Fusion[J]. Computer Engineering, 2026, 52(6): 268-277.

https://www.ecice06.com/CN/Y2026/V52/I6/268

图/表 14

图1 基于语境感知和多层次特征融合的模型结构

Fig.1 Model structure based on context awareness and multi-level feature fusion

图2 特征增强模块

Fig.2 Feature enhancement module

图3 语境感知模块

Fig.3 Context awareness module

图4 多层次特征融合模块

Fig.4 Multi-level feature fusion module

图5 在VQA-RAD和SLAKE数据集上的10次重复实验结果

Fig.5 Results of 10 repeated experiments on VQA-RAD and SLAKE datasets

图6 设置不同的k对模型性能的影响

Fig.6 Impact of setting different k on model performance

图7 模型在VQA-RAD数据集上的特征图可视化结果

Fig.7 Feature map visualization results of the model on VQA-RAD dataset

参考文献 32

1	LIN Z H , ZHANG D H , TAO Q Y , et al. Medical visual question answering: a survey. Artificial Intelligence in Medicine, 2023, 143, 102611. doi: 10.1016/j.artmed.2023.102611
2	ISHMAM M F , SHOVON M S H , MRIDHA M F , et al. From image to language: a critical analysis of Visual Question Answering (VQA) approaches, challenges, and opportunities. Information Fusion, 2024, 106, 102270.
3	PENG Y, LIU F, ROSEN M P. UMass at ImageCLEF Medical Visual Question Answering (Med-VQA) 2018 task[EB/OL]. [2024-05-17]. https://www.ceur-ws.org/Vol-2125/paper_163.pdf.
4	NGUYEN B D, DO T T, NGUYEN B X, et al. Overcoming data limitation in medical visual question answering[C]//Proceedings of the 22nd International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2019: 522-530.
5	FINN C, ABBEEL P, LEVINE S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proceedings of the International Conference on Machine Learning. [S. l. ]: PMLR, 2017: 1126-1135.
6	MASCI J, MEIER U, CIRESAN D, et al. Stacked convolutional auto-encoders for hierarchical feature extraction[C]//Proceedings of the 21st International Conference on Artificial Neural Networks and Machine Learning. Berlin, Germany: Springer, 2011: 52-59.
7	CONG F Z , XU S B , GUO L , et al. Anomaly matters: an anomaly-oriented model for medical visual question answering. IEEE Transactions on Medical Imaging, 2022, 41 (11): 3385- 3397. doi: 10.1109/TMI.2022.3185113
8	GONG H F, CHEN G Q, LIU S S, et al. Cross-modal self-attention with multi-task pre-training for medical visual question answering[C]//Proceedings of the 2021 International Conference on Multimedia Retrieval. New York, USA: ACM Press, 2021: 456-460.
9	KHARE Y, BAGAL V, MATHEW M, et al. MMBERT: multimodal BERT pretraining for improved medical VQA[C]//Proceedings of the 18th International Symposium on Biomedical Imaging (ISBI). Washington D.C., USA: IEEE Press, 2021: 1033-1036.
10	PELKA O, KOITKA S, RÜCKERT J, et al. Radiology Objects in COntext (ROCO): a multimodal image dataset[EB/OL]. [2024-05-17]. https://github.com/razorx89/roco-dataset.
11	MOON J H , LEE H , SHIN W , et al. Multi-modal understanding and generation for medical images and text via vision-language pre-training. IEEE Journal of Biomedical and Health Informatics, 2022, 26 (12): 6070- 6080. doi: 10.1109/JBHI.2022.3207502
12	LI P F, LIU G, TAN L, et al. Self-supervised vision-language pretraining for medial visual question answering[C]//Proceedings of the IEEE 20th International Symposium on Biomedical Imaging (ISBI). Washington D.C., USA: IEEE Press, 2023: 1-5.
13	RÜCKERT J, ABACHA A B, DE HERRERA A G S, et al. Overview of ImageCLEFmedical 2022—caption prediction and concept detection[C]//Proceedings of CEUR Workshop. Berlin, Germany: Springer, 2022: 1294-1307.
14	CHEN Z H , DU Y H , HU J P , et al. Mapping medical image-text to a joint space via masked modeling. Medical Image Analysis, 2024, 91, 103018. doi: 10.1016/j.media.2023.103018
15	SUBRAMANIAN S, WANG L L, MEHTA S, et al. MedICaT: a dataset of medical images, captions, and textual references[EB/OL]. [2024-05-17]. https://arxiv.org/abs/2010.06000.
16	吴志强, 解庆, 李琳, 等. 基于多模态融合的图神经网络推荐算法. 计算机工程, 2024, 50 (1): 91- 100. doi: 10.19678/j.issn.1000-3428.0066929
	WU Z Q , XIE Q , LI L , et al. Graph neural network recommendation algorithm based on multimodal fusion. Computer Engineering, 2024, 50 (1): 91- 100. doi: 10.19678/j.issn.1000-3428.0066929
17	ZHAN L M, LIU B, FAN L, et al. Medical visual question answering via conditional reasoning[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 2345-2354.
18	PAN H W , HE S N , ZHANG K J , et al. AMAM: an attention-based multimodal alignment model for medical visual question answering. Knowledge-Based Systems, 2022, 255, 109763.
19	刘凯, 任洪逸, 李蓥, 等. 基于交叉模态注意力特征增强的医学视觉问答. 计算机工程, 2025, 51 (6): 49- 56. doi: 10.19678/j.issn.1000-3428.0068910
	LIU K , REN H Y , LI Y , et al. Medical visual question answering based on cross-modal attention feature enhancement. Computer Engineering, 2025, 51 (6): 49- 56. doi: 10.19678/j.issn.1000-3428.0068910
20	HUANG X F , GONG H F . A dual-attention learning network with word and sentence embedding for medical visual question answering. IEEE Transactions on Medical Imaging, 2024, 43 (2): 832- 845. doi: 10.1109/TMI.2023.3322868
21	LI Y , YANG Q H , WANG F , et al. Asymmetric cross-modal attention network with multimodal augmented mixup for medical visual question answering. Artificial Intelligence in Medicine, 2023, 144, 102667. doi: 10.1016/j.artmed.2023.102667
22	吴梓恒. 基于细粒度特征提取和认知推理的医学视觉问答研究[D]. 南京: 南京信息工程大学, 2024.
	WU Z H. Research on medical visual question answering based on fine-grained feature extraction and cognitive reasoning[D]. Nanjing: Nanjing University of Information Science and Technology, 2024. (in Chinese)
23	REN F J , ZHOU Y Y . CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access, 2020, 8, 50626- 50636. doi: 10.1109/ACCESS.2020.2980024
24	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2024-05-17]. https://arxiv.org/abs/1706.03762.
25	LAU J J , GAYEN S , BEN ABACHA A , et al. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018, 5, 180251. doi: 10.1038/sdata.2018.251
26	LIU B, ZHAN L M, XU L, et al. Slake: a semantically-labeled knowledge-enhanced dataset for medical visual question answering[C]//Proceedings of the IEEE 18th International Symposium on Biomedical Imaging (ISBI). Washington D.C., USA: IEEE Press, 2021: 1650-1654.
27	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 618-626.
28	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2024-05-17]. https://arxiv.org/abs/1412.6980.
29	YANG Z C, HE X D, GAO J F, et al. Stacked attention networks for image question answering[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 21-29.
30	KIM J H, JUN J, ZHANG B T. Bilinear attention networks[EB/OL]. [2024-05-17]. https://arxiv.org/abs/1805.07932.
31	LIU B, ZHAN L M, WU X M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images[C]//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2021: 210-220.
32	ESLAMI S, MEINEL C, DE MELO G. PubMedCLIP: how much does CLIP benefit visual question answering in the medical domain?[C]//Proceedings of the Findings of the Association for Computational Linguistics: EACL 2023. Stroudsburg, USA: ACL Press, 2023: 1181-1193.

[1]	王永旗, 王雷. 基于跨模态增强与时间步门控的多模态情感识别[J]. 计算机工程, 2026, 52(6): 258-267.
[2]	崔爽锌, 卢搏, 张明月, 赵一汎, 王子铭, 刘新宇, 陈程立诏. 基于多模态融合的360°图像质量与美学评估方法[J]. 计算机工程, 2026, 52(6): 288-295.
[3]	李泽鸣, 王树良, 尚子贺, 盛明. 多模态检索增强生成驱动的文档问答综述(特邀)[J]. 计算机工程, 2026, 52(4): 1-21.
[4]	陈国莲, 冯梓洋, 曹均阔. 基于多模态空间特征融合的网络欺凌检测研究[J]. 计算机工程, 2026, 52(3): 255-263.
[5]	孙伟, 陈俊杰. MF-cache: 用于玉米病害识别的CLIP多模态缓存模型[J]. 计算机工程, 2026, 52(3): 420-428.
[6]	张添植, 周刚, 张爽, 陈静, 黄宁博, 吴皓. 针对图文模态间实体对齐的目标实体情感分类[J]. 计算机工程, 2026, 52(3): 222-233.
[7]	苏建华, 池云仙, 许云峰, 高凯. 基于注意力模态融合的多模态意图识别[J]. 计算机工程, 2026, 52(3): 234-242.
[8]	杨定裕, 邓喻丰, 钱诗友, 曹健, 薛广涛. 基于成分分解和多模态融合的云数据库产品用量预测[J]. 计算机工程, 2026, 52(3): 355-363.
[9]	王利民, 朱光辉, 吴涛. 大模型技术演进：世界模型让人工智能从感知走向决策(特邀)[J]. 计算机工程, 2026, 52(2): 1-6.
[10]	蒋翠玲, 程梓源, 俞新贵, 万永菁. 基于多尺度双流网络的深度伪造检测方法[J]. 计算机工程, 2026, 52(1): 242-253.
[11]	黎东丰, 陈雨人, 余博. 基于多层次特征融合的路面裂缝检测方法[J]. 计算机工程, 2026, 52(1): 154-165.
[12]	曾碧卿, 姚勇涛, 谢梁琦, 陈鹏飞, 邓会敏, 王瑞棠. 结合局部感知与多层次注意力的多模态方面级情感分析[J]. 计算机工程, 2025, 51(9): 80-90.
[13]	刘凯, 任洪逸, 李蓥, 季怡, 刘纯平. 基于交叉模态注意力特征增强的医学视觉问答[J]. 计算机工程, 2025, 51(6): 49-56.
[14]	龙丽叶, 焦世超, 郭磊, 韩燮, 况立群. 基于紧凑中心的多模态三维模型检索研究[J]. 计算机工程, 2025, 51(2): 322-334.
[15]	冯赛赛, 葛东峰, 李涛, 刘一靖, 冀治航, 王琳, 张明川. 基于多模态融合的宫颈上皮内瘤变辅助诊断[J]. 计算机工程, 2025, 51(12): 304-310.

选择文件类型/文献管理软件名称

选择包含的内容