语义驱动的全局–局部层级对齐的医学视觉语言分类模型

doi:10.19678/j.issn.1000-3428.0260138

摘要/Abstract

摘要： 多模态视觉-语言基础模型在医学领域展现出重要应用潜力，但由于医学数据语义结构复杂、跨模态关系建模困难，现有方法仍存在明显不足：一方面，基于患者的刚性对齐策略忽略语义相似性，导致不合理的负样本排斥，影响模型学习能力；另一方面，缺乏对报告与影像多层级语义结构的统一建模，难以实现细粒度的跨模态层次化对齐。针对上述问题，该文提出了一种语义驱动的全局–局部层级对齐的医学视觉语言分类模型（GLCA），通过全局–局部协同对齐实现更优的医学视觉语言分类模型。具体而言，GLCA包括语义驱动的患者间软全局对齐和渐进式三粒度患者内局部对齐两部分。语义驱动的患者间软全局对齐利用跨患者语义样本对挖掘和相关性加权对比惩罚来构建更连续、更符合真实语义关系的特征空间。渐进式三粒度患者内局部对齐通过渐进式查询融合策略在三个层次上对齐视觉与文本特征：粗粒度（报告–图像）、中粒度（句子–区域）、细粒度（词语–图块），实现跨模态与跨粒度的有效交互。其中，全局–局部协同对齐首先利用患者间的软全局对齐构建符合真实语义关系的特征空间，进而通过患者内的渐进式三粒度局部对齐实现视觉与文本特征的逐层匹配，两者协同优化，确保跨模态语义的连续嵌入与精准对应。在四个胸部X光数据集上进行的大量实验表明，GLCA在零样本预测分类和少样本微调分类任务中均显著优于现有方法。其中，对于公开的14分类胸部X光片数据集ChestXray14上的零样本预测分类实验，在AUC、F1和ACC指标上分别较次优方法提升了1.2%、2.0%和2.2%。

Abstract: Multimodal vision-language foundation models show great potential in the medical domain, yet face notable limitations due to complex medical semantics and challenging cross-modal modeling. Patient-level rigid alignment ignores semantic similarity, causing unreasonable negative repulsion and degrading learning, while the lack of unified hierarchical modeling between reports and images hinders fine-grained cross-modal alignment. To address the above issues, this paper proposes a global-local collaborative alignment (GLCA), which achieves an improved medical vision-language classification model. GLCA consists of two main components: semantic-driven cross-patient soft global alignment and progressive three-granularity intra-patient local alignment. The semantic-driven cross-patient soft global alignment leverages cross-patient semantic sample pair mining and correlation-weighted contrastive penalty to construct a more continuous feature space that better reflects authentic semantic relationships. The progressive three-granularity intra-patient local alignment aligns visual and textual features at three levels-coarse (report-image), mid (sentence-region), and fine (word-patch)-via progressive query fusion, enabling effective cross-modal interaction. Global-local collaborative alignment first builds a semantically consistent feature space through inter-patient soft global alignment, then performs layer-wise matching via intra-patient multi-granularity alignment, ensuring continuous and precise cross-modal semantic correspondence. Extensive experiments are conducted on four chest X-ray datasets. The results demonstrate that GLCA significantly outperforms existing methods in both zero-shot prediction classification and few-shot fine-tuning classification tasks. On the public 14-class ChestXray14 dataset, the zero-shot prediction classification achieves improvements of 1.2%, 2.0%, and 2.2% over the second-best method in terms of AUC, F1, and ACC, respectively.

张可冬, 钱旭升, 周志勇, 戴亚康. 语义驱动的全局–局部层级对齐的医学视觉语言分类模型[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260138.

Kedong Zhang, Xusheng Qian, Zhiyong Zhou , Yakang Dai. Semantic-driven Global–Local Hierarchical Alignment Medical Vision–Language Classification Model[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260138.

参考文献

[1] Wang J, Xu J, Zhou Y, et al. MultiXpert: Dual-stream synergistic enhancement with cross-modal alignment for zero-shot chest x-ray diagnosis[J]. Information Processing & Management, 2026, 63(2): 104468.
[2] 马翌硕, 张光南, 刘亚婷, 等. 视觉-语言模型研究综述[J]. 计算机技术与发展, 2026, 36(03): 1-10. Ma Y S, Zhang G N, Liu Y T, et al. A survey on vision-language models[J]. Computer Technology and Development, 2026, 36(03): 1-10.
[3] 刘萌, 齐孟津, 詹圳宇, 等. 基于深度学习的图像-文本匹配研究综述[J]. 计算机学报, 2023, 46(11): 2370-2399. Liu M, Qi M J, Zhan Z Y, et al. A survey of image-text matching based on deep learning[J]. Chinese Journal of Computers, 2023, 46(11): 2370-2399.
[4] Chen Z, Du Y, Hu J, et al. Mapping medical image-text to a joint space via masked modeling[J]. Medical Image Analysis, 2024, 91: 103018.
[5] Huang W, Li C, Zhou H-Y, et al. Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning[J]. Nature Communications, 2024, 15(1): 7620.
[6] Xu L, Xie H, Wang F L, et al. Contrastive sentence representation learning with adaptive false negative cancellation[J]. Information Fusion, 2024, 102: 102065.
[7] 安国成, 江波, 王晓龙, 等. 基于拓展图文对比学习的多模态语义对齐[J]. 计算机工程, 2024, 50(11): 152-162. An G C, Jiang B, Wang X L, et al. Multimodal Semantic Alignment Based on Extended Image-Text Contrastive Learning[J]. Computer Engineering, 2024, 50(11): 152-162.
[8] Yang Z, Xu X, Zhang J, et al. Chest X-Ray Foundation Model With Global and Local Representations Integration[J]. IEEE Transactions on Medical Imaging, 2025, 44(12): 4787-4799.
[9] Zhao Z, Wang S, Gu J, et al. Chatcad+: Towards a universal and reliable interactive cad using llms[J]. IEEE Transactions on Medical Imaging, 2024, 43(11): 3755-3766.
[10] Zhang Y, Jiang H, Miura Y, et al. Contrastive learning of medical visual representations from paired images and text[C]// Machine learning for healthcare conference. [S.l.]: PMLR, 2022: 2-25.
[11] Boecking B, Usuyama N, Bannur S, et al. Making the most of text semantics to improve biomedical vision–language processing[C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 1-21.
[12] Wang Z, Wu Z, Agarwal D, et al. Medclip: Contrastive learning from unpaired medical images and text[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. [S.l.]: ACL, 2022: 3876.
[13] Huynh T, Kornblith S, Walter M R, et al. Boosting contrastive self-supervised learning with false negative cancellation[C]// Proceedings of the IEEE/CVF winter conference on applications of computer vision. Los Alamitos, California: IEEE, 2022: 2785-2795.
[14] Liu B, Lu D, Wei D, et al. Improving medical vision-language contrastive pretraining with semantics-aware triage[J]. IEEE Transactions on Medical Imaging, 2023, 42(12): 3579-3589.
[15] Koleilat T, Asgariandehkordi H, Rivaz H, et al. Medclip-sam: Bridging text and image towards universal medical image segmentation[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2024: 643-653.
[16] Liu C, Cheng S, Shi M, et al. Imitate: Clinical prior guided hierarchical vision-language pre-training[J]. IEEE Transactions on Medical Imaging, 2024, 44(1): 519-529.
[17] Yu Y, Wang J, Liu W, et al. Multimodal multitask similarity learning for vision language model on radiological images and reports[J]. Neurocomputing, 2025, 636: 130018.
[18] Ni X, Wu L, Zhuang J, et al. MG-3D: Multi-Grained Knowledge-Enhanced Vision-Language Pre-training for 3D Medical Image Analysis[J]. Medical Image Analysis, 2026, 111: 104027.
[19] Huang S-C, Shen L, Lungren M P, et al. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, California: IEEE, 2021: 3942-3951.
[20] Cheng P, Lin L, Lyu J, et al. Prior: Prototype representation joint learning from medical images and reports[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, California: IEEE, 2023: 21361-21371.
[21] Wu C, Zhang X, Zhang Y, et al. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis[C]// Proceedings of the IEEE/CVF international conference on computer vision. Los Alamitos, California: IEEE, 2023: 21372-21383.
[22] Lai H, Yao Q, Jiang Z, et al. Carzero: Cross-attention alignment for radiology zero-shot classification[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, California: IEEE, 2024: 11137-11146.
[23] Park J, Yoon B, Kim S, et al. RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability[C]// The Thirty-ninth Annual Conference on Neural Information Processing Systems. San Diego: Curran Associates, 2025.
[24] Ibrahimi S, Sun X, Wang P, et al. Audio-enhanced text-to-video retrieval using text-conditioned feature alignment[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los [1] Wang J, Xu J, Zhou Y, et al. MultiXpert: Dual-stream synergistic enhancement with cross-modal alignment for zero-shot chest x-ray diagnosis[J]. Information Processing & Management, 2026, 63(2): 104468.
[25] Liang X, Li X, Li F, et al. MedFILIP: Medical Fine-Grained Language-Image Pre-Training[J]. IEEE Journal of Biomedical and Health Informatics, 2025, 29(5): 3587-3597.
[26] Zhou Y, Zhang S, Wang X, et al. A medical report generation method based on local visual modeling and image-text co-enhancement[J]. Biomedical Signal Processing and Control, 2026, 112: 108527.
[27] Jiang H, Hao X, Huang Y, et al. Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity[C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2024: 16-33.
[28] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]// International conference on machine learning. [S.l.]: PMLR, 2021: 8748-8763.
[29] Zhuang J, Jing X-Y, Jia X. Mining negative samples on contrastive learning via curricular weighting strategy[J]. Information Sciences, 2024, 668: 120534.
[30] Radenovic F, Dubey A, Kadian A, et al. Filtering, distillation, and hard negatives for vision-language pre-training[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2023: 6967-6977.
[31] Li Q, Yan X, Xu J, et al. Anatomical structure-guided medical vision-language pre-training[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2024: 80-90.
[32] Johnson A E, Pollard T J, Berkowitz S J, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports[J]. Scientific data, 2019, 6(1): 317.
[33] Phan V M H, Xie Y, Qi Y, et al. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2024: 11492-11501.
[34] Zou L, Li J, Chen H, et al. MCG-Net: Medical Chief Complaint-guided Multi-modal Masked Content Pre-training for chest image classification[J]. Expert Systems with Applications, 2025, 271: 126660.
[35] Wang X, Peng Y, Lu L, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2017: 2097-2106.
[36] Irvin J, Rajpurkar P, Ko M, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison[C]// Proceedings of the AAAI conference on artificial intelligence. Palo Alto, California: AAAI Press, 2019: 590-597.
[37] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2026-03-18]. https://arxiv.org/abs/2010.11929.
[38] Chen Z, Du Y, Hu J, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2022: 679-689.
[39] Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.

选择文件类型/文献管理软件名称

选择包含的内容