[1] Wang J, Xu J, Zhou Y, et al. MultiXpert: Dual-stream synergistic enhancement with cross-modal alignment for zero-shot chest x-ray diagnosis[J]. Information Processing & Management, 2026, 63(2): 104468.
[2] 马翌硕, 张光南, 刘亚婷, 等. 视觉-语言模型研究综述[J]. 计算机技术与发展, 2026, 36(03): 1-10.
Ma Y S, Zhang G N, Liu Y T, et al. A survey on vision-language models[J]. Computer Technology and Development, 2026, 36(03): 1-10.
[3] 刘萌, 齐孟津, 詹圳宇, 等. 基于深度学习的图像-文本匹配研究综述[J]. 计算机学报, 2023, 46(11): 2370-2399.
Liu M, Qi M J, Zhan Z Y, et al. A survey of image-text matching based on deep learning[J]. Chinese Journal of Computers, 2023, 46(11): 2370-2399.
[4] Chen Z, Du Y, Hu J, et al. Mapping medical image-text to a joint space via masked modeling[J]. Medical Image Analysis, 2024, 91: 103018.
[5] Huang W, Li C, Zhou H-Y, et al. Enhancing representation in radiography-reports foundation model: A granular alignment algorithm using masked contrastive learning[J]. Nature Communications, 2024, 15(1): 7620.
[6] Xu L, Xie H, Wang F L, et al. Contrastive sentence representation learning with adaptive false negative cancellation[J]. Information Fusion, 2024, 102: 102065.
[7] 安国成, 江波, 王晓龙, 等. 基于拓展图文对比学习的多模态语义对齐[J]. 计算机工程, 2024, 50(11): 152-162.
An G C, Jiang B, Wang X L, et al. Multimodal Semantic Alignment Based on Extended Image-Text Contrastive Learning[J]. Computer Engineering, 2024, 50(11): 152-162.
[8] Yang Z, Xu X, Zhang J, et al. Chest X-Ray Foundation Model With Global and Local Representations Integration[J]. IEEE Transactions on Medical Imaging, 2025, 44(12): 4787-4799.
[9] Zhao Z, Wang S, Gu J, et al. Chatcad+: Towards a universal and reliable interactive cad using llms[J]. IEEE Transactions on Medical Imaging, 2024, 43(11): 3755-3766.
[10] Zhang Y, Jiang H, Miura Y, et al. Contrastive learning of medical visual representations from paired images and text[C]// Machine learning for healthcare conference. [S.l.]: PMLR, 2022: 2-25.
[11] Boecking B, Usuyama N, Bannur S, et al. Making the most of text semantics to improve biomedical vision–language processing[C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2022: 1-21.
[12] Wang Z, Wu Z, Agarwal D, et al. Medclip: Contrastive learning from unpaired medical images and text[C]// Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing. [S.l.]: ACL, 2022: 3876.
[13] Huynh T, Kornblith S, Walter M R, et al. Boosting contrastive self-supervised learning with false negative cancellation[C]// Proceedings of the IEEE/CVF winter conference on applications of computer vision. Los Alamitos, California: IEEE, 2022: 2785-2795.
[14] Liu B, Lu D, Wei D, et al. Improving medical vision-language contrastive pretraining with semantics-aware triage[J]. IEEE Transactions on Medical Imaging, 2023, 42(12): 3579-3589.
[15] Koleilat T, Asgariandehkordi H, Rivaz H, et al. Medclip-sam: Bridging text and image towards universal medical image segmentation[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2024: 643-653.
[16] Liu C, Cheng S, Shi M, et al. Imitate: Clinical prior guided hierarchical vision-language pre-training[J]. IEEE Transactions on Medical Imaging, 2024, 44(1): 519-529.
[17] Yu Y, Wang J, Liu W, et al. Multimodal multitask similarity learning for vision language model on radiological images and reports[J]. Neurocomputing, 2025, 636: 130018.
[18] Ni X, Wu L, Zhuang J, et al. MG-3D: Multi-Grained Knowledge-Enhanced Vision-Language Pre-training for 3D Medical Image Analysis[J]. Medical Image Analysis, 2026, 111: 104027.
[19] Huang S-C, Shen L, Lungren M P, et al. Gloria: A multimodal global-local representation learning framework for label-efficient medical image recognition[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, California: IEEE, 2021: 3942-3951.
[20] Cheng P, Lin L, Lyu J, et al. Prior: Prototype representation joint learning from medical images and reports[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los Alamitos, California: IEEE, 2023: 21361-21371.
[21] Wu C, Zhang X, Zhang Y, et al. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis[C]// Proceedings of the IEEE/CVF international conference on computer vision. Los Alamitos, California: IEEE, 2023: 21372-21383.
[22] Lai H, Yao Q, Jiang Z, et al. Carzero: Cross-attention alignment for radiology zero-shot classification[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Los Alamitos, California: IEEE, 2024: 11137-11146.
[23] Park J, Yoon B, Kim S, et al. RadZero: Similarity-Based Cross-Attention for Explainable Vision-Language Alignment in Chest X-ray with Zero-Shot Multi-Task Capability[C]// The Thirty-ninth Annual Conference on Neural Information Processing Systems. San Diego: Curran Associates, 2025.
[24] Ibrahimi S, Sun X, Wang P, et al. Audio-enhanced text-to-video retrieval using text-conditioned feature alignment[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision. Los [1] Wang J, Xu J, Zhou Y, et al. MultiXpert: Dual-stream synergistic enhancement with cross-modal alignment for zero-shot chest x-ray diagnosis[J]. Information Processing & Management, 2026, 63(2): 104468.
[25] Liang X, Li X, Li F, et al. MedFILIP: Medical Fine-Grained Language-Image Pre-Training[J]. IEEE Journal of Biomedical and Health Informatics, 2025, 29(5): 3587-3597.
[26] Zhou Y, Zhang S, Wang X, et al. A medical report generation method based on local visual modeling and image-text co-enhancement[J]. Biomedical Signal Processing and Control, 2026, 112: 108527.
[27] Jiang H, Hao X, Huang Y, et al. Advancing Medical Radiograph Representation Learning: A Hybrid Pre-training Paradigm with Multilevel Semantic Granularity[C]// Proceedings of the European Conference on Computer Vision. Cham: Springer, 2024: 16-33.
[28] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]// International conference on machine learning. [S.l.]: PMLR, 2021: 8748-8763.
[29] Zhuang J, Jing X-Y, Jia X. Mining negative samples on contrastive learning via curricular weighting strategy[J]. Information Sciences, 2024, 668: 120534.
[30] Radenovic F, Dubey A, Kadian A, et al. Filtering, distillation, and hard negatives for vision-language pre-training[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2023: 6967-6977.
[31] Li Q, Yan X, Xu J, et al. Anatomical structure-guided medical vision-language pre-training[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2024: 80-90.
[32] Johnson A E, Pollard T J, Berkowitz S J, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports[J]. Scientific data, 2019, 6(1): 317.
[33] Phan V M H, Xie Y, Qi Y, et al. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework[C]// Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2024: 11492-11501.
[34] Zou L, Li J, Chen H, et al. MCG-Net: Medical Chief Complaint-guided Multi-modal Masked Content Pre-training for chest image classification[J]. Expert Systems with Applications, 2025, 271: 126660.
[35] Wang X, Peng Y, Lu L, et al. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases[C]// Proceedings of the IEEE conference on computer vision and pattern recognition. Los Alamitos, California: IEEE, 2017: 2097-2106.
[36] Irvin J, Rajpurkar P, Ko M, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison[C]// Proceedings of the AAAI conference on artificial intelligence. Palo Alto, California: AAAI Press, 2019: 590-597.
[37] Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2026-03-18]. https://arxiv.org/abs/2010.11929.
[38] Chen Z, Du Y, Hu J, et al. Multi-modal masked autoencoders for medical vision-and-language pre-training[C]// Proceedings of the International Conference on Medical Image Computing and Computer-Assisted Intervention. Cham: Springer, 2022: 679-689.
[39] Lee J, Yoon W, Kim S, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining[J]. Bioinformatics, 2020, 36(4): 1234-1240.
|