Medical Visual Question Answering Based on Cross-Modal Attention Feature Enhancement

doi:10.19678/j.issn.1000-3428.0068910

Abstract

Abstract:

Medical Visual Question Answering (Med-VQA) requires an understanding of content related to both medical images and text-based questions. Therefore, designing effective modal representations and cross-modal fusion methods is crucial for performing well in Med-VQA tasks. Currently, Med-VQA methods focus only on the global features of medical images and the distribution of attention within a single modality, ignoring medical information in the local features of images and cross-modal interactions, thereby limiting the understanding of image content. This study proposes the Cross-Modal Attention-Guided Medical VQA (CMAG-MVQA) model. First, based on U-Net encoding, this method effectively enhances the local features of an image. Second, from the perspective of cross-modal collaboration, a selection guided attention method is proposed to introduce interactive information from other modalities. In addition, a self-attention mechanism is used to further enhance the image representation obtained by selective guided attention acquisition. Ablation and comparative experiments on the VQA-RAD medical question-answering dataset show that the proposed method performs well in Med-VQA tasks and improves feature representation performance compared to similar methods.

Key words: cross-modal interaction, attention mechanism, Medical Visual Question Answering (Med-VQA), feature fusion, feature enhancement

摘要：

医学视觉问答(Med-VQA)需要对医学图像内容与问题文本内容进行理解与结合，因此设计有效的模态表征及跨模态的融合方法对Med-VQA任务的表现至关重要。目前，Med-VQA方法通常只关注医学图像的全局特征以及单一模态内注意力分布，忽略了图像的局部特征所包含的医学信息与跨模态间的交互作用，从而限制了图像内容理解。针对以上问题，提出一种交叉模态注意力特征增强的Med-VQA模型(CMAG-MVQA)。基于U-Net编码有效增强图像局部特征，从交叉模态协同角度提出选择引导注意力方法，为单模态表征引入其他模态的交互信息，同时利用自注意力机制进一步增强选择引导注意力的图像表征。在VQA-RAD医学问答数据集上的消融与对比实验表明，所提方法在Med-VQA任务上有良好的表现，相比于现有同类方法，其在特征表征上性能得到较好改善。

关键词: 跨模态交互, 注意力机制, 医学视觉问答, 特征融合, 特征增强

LIU Kai, REN Hongyi, LI Ying, JI Yi, LIU Chunping. Medical Visual Question Answering Based on Cross-Modal Attention Feature Enhancement[J]. Computer Engineering, 2025, 51(6): 49-56.

刘凯, 任洪逸, 李蓥, 季怡, 刘纯平. 基于交叉模态注意力特征增强的医学视觉问答[J]. 计算机工程, 2025, 51(6): 49-56.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068910

https://www.ecice06.com/EN/Y2025/V51/I6/49

Figures/Tables 6

References 27

1	PENG Y L, LIU F. UMass at ImageCLEF Medical Visual Question Answering (Med-VQA) 2018 task[C]//Proceedings of ImageCLEF 2018. New York, USA: ACM Press, 2018: 1-9.
2	ABBISHEK T, KRISHNAMOORTHI M. MIT Manipal at ImageCLEF 2019 visual question answering in medical domain[C]//Proceedings of ImageCLEF 2019. New York, USA: ACM Press, 2019: 1-6.
3	HUANG G, LIU Z, VAN DER MAATEN L, et al. Densely connected convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 4700-4708.
4	REN F J, ZHOU Y Y. CGMVQA: a new classification and generative model for medical visual question answering. IEEE Access, 2020, 8, 50626- 50636. doi: 10.1109/ACCESS.2020.2980024
5	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-10-11]. https://arxiv.org/abs/1810.04805v2.
6	GONG H F, CHEN G Q, LIU S S, et al. Cross-modal self-attention with multi-task pre-training for medical visual question answering[C]//Proceedings of the 2021 International Conference on Multimedia Retrieval. New York, USA: ACM Press, 2021: 456-460.
7	KIM W, SON B, KIM I. Vilt: vision-and-language transformer without convolution or region supervision[C]//Proceedings of International Conference on Machine Learning. [S. l. ]: PMLR, 2021: 5583-5594.
8	LIU B, ZHAN L M, WU X M. Contrastive pre-training and representation distillation for medical visual question answering based on radiology images[C]//Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2021: 210-220.
9	ANDERSON P, HE X D, BUEHLER C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6077-6086.
10	KIM J H, JUN J, ZHANG B T. Bilinear attention networks[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. Berlin, Germany: Springer, 2018: 1571-1581.
11	YU Z, YU J, CUI Y H, et al. Deep modular co-attention networks for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 6281-6290.
12	邹品荣, 肖锋, 张文娟, 等. 面向视觉问答的多模块协同注意模型. 计算机工程, 2022, 48(2): 250- 260. doi: 10.19678/j.issn.1000-3428.0061159
	ZOU P R, XIAO F, ZHANG W J, et al. Multi-module co-attention model for visual question answering. Computer Engineering, 2022, 48(2): 250- 260. doi: 10.19678/j.issn.1000-3428.0061159
13	ZHENG W B, YAN L, WANG F Y, et al. Learning from the guidance: knowledge embedded meta-learning for medical visual question answering[C]//Proceedings of the 27th International Conference on Neural Information Processing. Berlin, Germany: Springer, 2020: 194-202.
14	CHEN Z H, DU Y H, HU J P, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training[C]//Proceedings of International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2022: 679-689.
15	白亚龙. 面向图像与文本的多模态关联学习的研究与应用[D]. 哈尔滨: 哈尔滨工业大学, 2018.
	BAI Y L. Research and application of multimodal relevance learning for image and text[D]. Harbin: Harbin Institute of Technology, 2018. (in Chinese)
16	何俊, 张彩庆, 李小珍, 等. 面向深度学习的多模态融合技术研究综述. 计算机工程, 2020, 46(5): 1- 11. doi: 10.19678/j.issn.1000-3428.0057370
	HE J, ZHANG C Q, LI X Z, et al. Survey of research on multimodal fusion technology for deep learning. Computer Engineering, 2020, 46(5): 1- 11. doi: 10.19678/j.issn.1000-3428.0057370
17	FUKUI A, PARK D H, YANG D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[EB/OL]. [2023-10-11]. https://arxiv.org/abs/1606.01847v3.
18	YU Z, YU J, FAN J P, et al. Multi-modal factorized bilinear pooling with co-attention learning for visual question answering[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 1821-1830.
19	CADENE R, BEN-YOUNES H, CORD M, et al. MUREL: multimodal relational reasoning for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 1989-1998.
20	YU Y L, LI H F, SHI H R, et al. Question-guided feature pyramid network for medical visual question answering. Expert Systems with Applications, 2023, 214, 119148. doi: 10.1016/j.eswa.2022.119148
21	NGUYEN B D, DO T T, NGUYEN B X, et al. Overcoming data limitation in medical visual question answering[C]//Proceedings of the 22nd International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2019: 522-530.
22	ZHANG Y J, CHEN Q Y, YANG Z H, et al. BioWordVec, improving biomedical word embeddings with subword information and MeSH. Scientific Data, 2019, 6, 52. doi: 10.1038/s41597-019-0055-0
23	LAU J J, GAYEN S, BEN A A, et al. A dataset of clinically generated visual questions and answers about radiology images. Scientific Data, 2018, 5, 180251. doi: 10.1038/sdata.2018.251
24	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7132-7141.
25	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
26	DO T, NGUYEN B X, TJIPUTRA E, et al. Multiple meta-model quantifying for medical visual question answering[C]//Proceedings of the 24th International Conference on Medical Image Computing and Computer Assisted Intervention. Berlin, Germany: Springer, 2021: 64-74.
27	LIU B, ZHAN L M, XU L, et al. Medical visual question answering via conditional reasoning and contrastive learning. IEEE Transactions on Medical Imaging, 2022, 42(5): 1532- 1545.

[1]	SHAN Pengchang, GAO Lijian, DONG Wenlong, MAO Qirong. Action Detection Method Based on Salient Target Tracking [J]. Computer Engineering, 2025, 51(6): 93-101.
[2]	ZHENG Cheng, LI Pengfei. Text Classification Based on Feature Fusion of Dual Hypergraph Neural Networks [J]. Computer Engineering, 2025, 51(6): 127-135.
[3]	ZHAO Xiaohu, XIE Lixun, MU Dengcong, ZHANG Yue. Metal Surface Defect Detection Method Based on TCM-YOLO Network [J]. Computer Engineering, 2025, 51(6): 338-348.
[4]	LI Baiya. CNN-Transformer-Based Lesion and Organ Segmentation Network for Electronic Laryngoscope [J]. Computer Engineering, 2025, 51(6): 327-337.
[5]	HUA Jiabao, ZHANG Jingrui, ZHU Fumin, CHEN Lu. Adaptive Spatial Transformation Method for Vehicle Detection Based on Roadside Cameras [J]. Computer Engineering, 2025, 51(6): 349-359.
[6]	LI Yi, XU Huiying, ZHU Xinzhong, HUANG Xiao, WANG Shumeng, LI Xiyu. Mask-YOLO: Improved Mask Detection Algorithm Based on YOLOv5n [J]. Computer Engineering, 2025, 51(6): 297-310.
[7]	CAO Bei, ZHAO Kui. Dual Emotion and Multi-feature Fusion Based Fake News Detection [J]. Computer Engineering, 2025, 51(6): 193-203.
[8]	MA Yuekun, MA Mingyou. Metaphor Recognition Model Based on Weighted Integration of Global and Local Features [J]. Computer Engineering, 2025, 51(5): 143-153.
[9]	XIE Qing, ZHANG Lingfeng, MA Yanchun, LIU Yongjian. Single Image Reflection Removal Model Based on Reflection Classifier and Gradient Restorer [J]. Computer Engineering, 2025, 51(4): 227-238.
[10]	GENG Xia, WANG Yao. Cloth-Changing Person Re-Identification Method Based on CLIP Enhanced Fine-Grained Features [J]. Computer Engineering, 2025, 51(4): 293-302.
[11]	LIU Yunxiang, LIANG Zhichao. A Highly Efficient Traffic Prediction Model for Continuous Time-series Graph Attention Networks [J]. Computer Engineering, 2025, 51(4): 350-359.
[12]	YANG Ping, ZHANG Xi. Improved DeepLabv3+ Road Surface Crack Detection Method [J]. Computer Engineering, 2025, 51(4): 261-270.
[13]	XU Yonggang, SUN Qixuan, LI Fanjia, CHENG Jianwei, DAI Jiajun. Skeleton Behavior Recognition Based on Extended Temporal and Spatiotemporal Feature Fusion Graph Convolutional Network [J]. Computer Engineering, 2025, 51(4): 281-292.
[14]	DU Chenyang, ZHANG Xueying, HUANG Lixia, LI Juan. Multi-Feature Speech Emotion Recognition Based on Improved Efficient Channel Attention Mechanism [J]. Computer Engineering, 2025, 51(4): 97-106.
[15]	HUANG Shuoqing, HUANG Jingui. Improved Steel Defect Detection Method Based on Enhanced Fusion of RFB and YOLOv5 Features [J]. Computer Engineering, 2025, 51(4): 249-260.

Please choose a citation manager

Content to export