[1] Cao M, Li S, Li J, et al. Image-text retrieval: A survey on recent research and development[J]. arxiv preprint arxiv:2203.14713, 2022.
[2] 张振兴, 王亚雄. 图文跨模态检索研究综述[J]. 北京交通大学学报, 2024, 48(02): 23-36.Zhang Zhenxing, Wang Yaxiong. A survey of research on image-text cross-modal retrieval [J]. Journal of Beijing Jiaotong University, 2024, 48(02): 23-36.
[3] Wang T, Li F, Zhu L, et al. Cross-modal retrieval: a systematic review of methods and future directions[J]. arxiv preprint arxiv:2308.14263, 2023.
[4] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[5] Bin Y, Li H, Xu Y, et al. Unifying two-stream encoders with transformers for cross-modal retrieval[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 3041-3050.
[6] Wang G, Shang Y, Chen Y, et al. Scene graph based fusion network for image-text retrieval[C]//2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023: 138-143.
[7] LeCun Y, Bottou L, Bengio Y, et al. Gradient-based learning applied to document recognition[J]. Proceedings of the IEEE, 2002, 86(11): 2278-2324.
[8] Liang J, Cao J, Sun G, et al. Swinir: Image restoration using swin transformer[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2021: 1833-1844.
[9] Gupta S, Mukherjee P, Chaudhury S, et al. DFTNet: Deep fish tracker with attention mechanism in unconstrained marine environments[J]. IEEE Transactions on Instrumentation and Measurement, 2021, 70: 1-13.
[10] Li Z, Zhang Z, Li M, et al. Dual Fine-Grained network with frequency Transformer for change detection on remote sensing images[J]. International Journal of Applied Earth Observation and Geoinformation, 2025, 136: 104393.
[11] Cheng M, Sun Y, Wang L, et al. Vista: Vision and scene text aggregation for cross-modal retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 5184-5193.
[12] Chen Y C, Li L, Yu L, et al. Uniter: Universal image-text representation learning[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 104-120.
[13] Li J, Selvaraju R, Gotmare A, et al. Align before fuse: Vision and language representation learning with momentum distillation[J]. Advances in neural information processing systems, 2021, 34: 9694-9705.
[14] Li J, Li D, Xiong C, et al. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation[C]//International conference on machine learning. PMLR, 2022: 12888-12900.
[15] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[16] Shotaro A. A kernel method for canonical correlation analysis[C]//International Meeting of Psychometric Society, 2001. 2001.
[17] Faghri F, et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives[C]//British Machine Vision Conference (BMVC). Newcastle, UK: BMVA Press, 2018: 1-12.
[18] Li L, et al. Visual Semantic Reasoning for Image-Text Matching[C]//IEEE International Conference on Computer Vision (ICCV). Seoul, South Korea: IEEE, 2019: 4653-4661.
[19] Li K, Zhang Y, Li K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(1): 641-656.
[20] Wang H, Zhang Y, Ji Z, et al. Consensus-Aware Visual-Semantic Embedding for Image-Text Matching[C]//Computer Vision – ECCV 2020. Cham: Springer International Publishing, 2020: 18-34.
[21] Wang F, Zhou Y, Wang S, et al. Multi-granularity cross-modal alignment for generalized medical visual representation learning[J]. Advances in Neural Information Processing Systems, 2022, 35: 33536-33549.
[22] 高迪辉,盛立杰,许小冬,等.图文跨模态检索的联合特征方法[J].西安电子科技大学学报,2024,51(04):128-138.Gao Dihui, Sheng Lijie, Xu Xiaodong, et al.Joint feature approach for image-text cross-modal retrieval. [J] Journal of Xi'an University of Electronic Science and Technology, 2024, 51(04): 128-138.
[23] Zhang K, Mao Z, Wang Q, et al. Negative-aware attention framework for image-text matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 15661-15670.
[24] Rao J, Ding L, Qi S, et al. Dynamic contrastive distillation for image-text retrieval[J]. IEEE Transactions on Multimedia, 2023, 25: 8383-8395.
[25] Huang H, Nie Z, Wang Z, et al. Cross-modal and uni-modal soft-label alignment for image-text retrieval[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2024, 38(16): 18298-18306.
[26] Lee K H, et al. Stacked Cross Attention for Image-Text Matching[C]//European Conference on Computer Vision (ECCV). Munich, Germany: Springer, 2018: 201-216.
[27] Chen H, Ding G, Liu X, et al. IMRAM: Iterative Matching With Recurrent Attention emory for Cross-Modal Image-Text Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Seattle, WA, USA: IEEE, 2020: 12652-12660.
[28] Liu Y, Liu H, Wang H, et al. BCAN: Bidirectional correct attention network for cross-modal retrieval[J]. IEEE Transactions on Neural Networks and Learning Systems, 2023, 35(10): 14247-14258.
[29] Qin X, Li L, Pang G, et al. Heterogeneous graph fusion network for cross-modal image-text retrieval[J]. Expert Systems with Applications, 2024, 249: 123842.
[30] Liang X, Yang E, Deng C, et al. CrossFormer: Cross-Modal Representation Learning via Heterogeneous Graph Transformer[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(12): Article 380. DOI: 10.1145/3688801.
[31] Cui W, Cheng R, Guo J, et al. MVAM: Multi-View Attention Method for Fine-Grained Image-Text Matching[C]//European Conference on Information Retrieval. Cham: Springer Nature Switzerland, 2025: 169-184.
[32] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[33] Li J, Li D, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[C]//International conference on machine learning. PMLR, 2023: 19730-19742.
[34] Krishna R, Zhu Y, Groth O, et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations[J]. Int. J. Comput. Vision, 2017, 123(1): 32-73
[35] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[36] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.
[37] Zhao G, Zhang C, Shang H, et al. Generative label fused network for image–text matching[J]. Knowledge-Based Systems, 2023, 263: 110280.
[38] Cheng Y, Zhu X, Qian J, et al. Cross-modal graph matching network for image-text retrieval[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2022, 18(4): 1-23.
[39] 杨钰雪,何甜,樊京杭,等.基于交叉注意力与特征聚合的跨模态图文检索研究[J/OL].计算机工程,1-12[2025-04-24]. Yang Yuxue, He Tian, Fan Jinghang, et al. Research on Cross-modal Image and Text Retrieval Based on Cross-Attention and Feature Aggregation [J/OL]. Computer Engineering, 1-12 [2025-04-24].
[40] 曾光,彭德中,宋小民,et al. 基于提示的自然语言视觉搜索研究[J]. 四川大学学报(自然科学版), 2025, 62(4): 857-863. Zeng Guang et al.The research on prompt-based natural language visual search.Journal Of Sichuan University (Natural Sciences Division)/Sichuan Daxue Xuebao-Ziran Kexueban 62.4 (2025).
[41] 梁彦鹏,刘雪儿,马忠贵,等.嵌入共识知识的因果图文检索方法[J]. 工程科学学报, 2024, 46(2): 317-328.Liang, Y. P., Liu, X. E., Ma, Z. G., et al. A Causal Image-Text Retrieval Method Embedded with Consensus Knowledge [J]. Chinese Journal of Engineering, 2024, 46(2): 317-328.
|