[1] 张振兴,王亚雄. 图文跨模态检索研究综述 [J]. 北京交通大学学报, 2024, 48 (02): 23-36.
Zhang Zhenxing, Wang Yaxiong. A Survey of Image-Text Cross-Modal Retrieval Research [J]. Journal of Beijing Jiaotong University, 2024, 48(02): 23-36.
[2] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[3] Zhang Y, Jin R, Zhou Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52.
[4] Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural computation, 2004, 16(12): 2639-2664.
[5] Zheng W, Zhou X, Zou C, et al. Facial expression recognition using kernel canonical correlation analysis(KCCA)[J]. IEEE Transactions on Neural Networks, 2006, 17(1): 233-238
[6] Faghri F, Fleet D J, Kiros J R, et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives[C]//Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK: BMVA Press, 2018: 1-14.
[7] Li Z, Guo C, Wang X, et al. Selectively hard negative mining for alleviating gradient vanishing in image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(2): 1921-1935.
[8] Li Z, Lu H, Fu H, et al. Image-text bidirectional learning network based crossmodal retrieval[J]. Neurocomputing, 2022, 483: 148-159.
[9] Zhang Y, Ji Z, Wang D, et al. USER: Unified semantic enhancement with momentum contrast for image-text retrieval[J]. IEEE Transactions on Image Processing, 2024, 33: 595-609.
[10] Pham K, Huynh C, Lim S N, et al. Composing object relations and attributes for image-text matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2024: 14354-14363.
[11] Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European conference on computer vision (ECCV). Cham, Switzerland: Springer, 2018: 201-216.
[12] Li K, Zhang Y, Li K, et al. Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF international conference on computer vision. Piscataway, USA: IEEE, 2019: 4654-4662.
[13] Pan Z, Wu F, Zhang B. Fine-grained image-text matching by cross-modal hard aligning network[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2023: 19275-19284.
[14] Messina N, Amato G, Esuli A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021, 17(4): 1-23.
[15] Diao H, Zhang Y, Ma L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2021, 35(2): 1218-1226.
[16] Zhang K, Mao Z, Wang Q, et al. Negative-aware attention framework for image-text matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2022: 15661-15670.
[17] 杨钰雪,何甜,樊京杭,等. 基于交叉注意力与特征聚合的跨模态图文检索研究 [J/OL]. 计算机工程, 1-12[2025-10-30]. https://doi.org/10.19678/j.issn.1000-3428.0070119.
Yang Yuxue, He Tian, Fan Jinghang, et al. Research on Cross-Modal Image-Text Retrieval Based on Cross-Attention and Feature Aggregation [J/OL]. Computer Engineering, 1-12[2025-10-30].https://doi.org/10.19678/j.issn.1000-3428.0070119
[18] Li M, Gao Y, Zhao H, et al. Progressive semantic aggregation and structured cognitive enhancement for image–text matching[J]. Expert Systems with Applications, 2025, 274: 126943.
[19] Wang P, Zhang L, Mao Z, et al. Matryoshka Learning with Metric Transfer for Image-text Matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(9): 9502-9516.
[20] Krishna R, Zhu Y, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123(1): 32-73.
[21] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137-1149.
[22] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, USA: IEEE, 2018: 6077-6086.
[23] 余超,王铭硕,赵子樵,等. 基于图像相对位置和负向感知的图文匹配 [J]. 现代电子技术, 2024, 47 (17): 88-93. DOI:10.16652/j.issn.1004-373x.2024.17.014.
Yu Chao, Wang Mingshuo, Zhao Ziqiao, et al. Image-Text Matching Based on Relative Position of Images and Negative Perception [J]. Modern Electronics Technique, 2024, 47(17): 88-93.
[24] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Stroudsburg, USA: Association for Computational Linguistics, 2019: 4171-4186.
[25] Chen J, Hu H, Wu H, et al. Learning the best pooling strategy for visual semantic embedding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2021: 15789-15798.
[26] Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the association for computational linguistics, 2014, 2: 67-78.
[27] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Cham: Springer International Publishing, 2014: 740-755.
[28] Wei X, Zhang T, Li Y, et al. Multi-modality cross attention network for image and sentence matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2020: 10941-10950.
[29] Zhang H, Mao Z, Zhang K, et al. Show your faith: Cross-modal confidence-aware network for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2022, 36(3): 3262-3270.
[30] Li K, Zhang Y, Li K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(1): 641-656.
[31] Zhu H, Zhang C, Wei Y, et al. ESA: External space attention aggregation for image-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(10): 6131-6143.
[32] Radenović F, Tolias G, Chum O. Fine-tuning CNN image retrieval with no human annotation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(7): 1655-1668.
|