[1] LOWE D G. Distinctive image features from scale-invariant keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110. [2] ZHANG Y, JIN R, ZHOU Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52. [3] JELODAR H, WANG Y L, YUAN C, et al. Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey[J]. Multimedia Tools and Applications, 2019, 78(11): 15169-15211. [4] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J. Canonical correlation analysis: an overview with application to learning methods[J]. Neural Computation, 2004, 16(12): 2639-2664. [5] ZHENG W M, ZHOU X Y, ZOU C R, et al. Facial expression recognition using Kernel Canonical Correlation Analysis (KCCA)[J]. IEEE Transactions on Neural Networks, 2006, 17(1): 233-238. [6] BENTON A, KHAYRALLAH H, GUJRAL B, et al. Deep generalized canonical correlation analysis[C]//Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019).[S.l.]:ACL,2019: 1-6. [7] 高迪辉, 盛立杰, 许小冬, 等. 图文跨模态检索的联合特征方法[J]. 西安电子科技大学学报, 2024, 51(4): 128-138. GAO D H, SHENG L J, XU X D, et al. Joint feature approach for image-text cross-modal retrieval[J]. Journal of Xidian University, 2024, 51(4): 128-138. (in Chinese) [8] LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL].[2024-05-05]. https://arxiv.org/abs/1908.02265. [9] RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL].[2024-05-05]. https://arxiv.org/abs/2103.00020. [10] LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL].[2024-05-05]. https://arxiv.org/abs/2201.12086. [11] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 40th International Conference on Machine Learning. New York,USA:ACM Press, 2023: 19730-19742. [12] LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[EB/OL].[2024-05-05]. https://arxiv.org/abs/1803.08024. [13] ZHANG K, MAO Z D, WANG Q, et al. Negative-aware attention framework for image-text matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.,USA:IEEE Press,2022: 15640-15649. [14] YANG J Y, DUAN J L, TRAN S, et al. Vision-language pre-training with triple contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.,USA:IEEE Press,2022: 15650-15659. [15] LIU Z, PEI X L, GAO S S, et al. Perceive, reason, and align: context-guided cross-modal correlation learning for image-text retrieval[J]. Applied Soft Computing, 2024, 154: 111395. [16] KRISHNA R, ZHU Y K, GROTH O, et al. Visual genome: connecting language and vision using crowd sourced dense image annotations[J]. International Journal of Computer Vision, 2017, 123(1): 32-73. [17] REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149. [18] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL].[2024-05-05]. https://arxiv.org/abs/1810.04805. [19] CHEN J C, HU H X, WU H, et al. Learning the best pooling strategy for visual semantic embedding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.,USA:IEEE Press,2021: 15784-15793. [20] DELIÈGE A, ISTASSE M, KUMAR A, et al. Ordinal pooling[C]//Proceedings of the 30th British Machine Vision Conference. Cardiff, UK: BMVA Press, 2019: 76. [21] KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.,USA:IEEE Press,2015: 3128-3137. [22] LI K P, ZHANG Y L, LI K, et al. Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C.,USA:IEEE Press,2019: 4653-4661. [23] MITHUN N C, PANDA R, PAPALEXAKIS E E, et al. Webly supervised joint embedding for cross-modal image-text retrieval[C]//Proceedings of the 26th ACM International Conference on Multimedia. New York,USA:ACM Press,2018: 1856-1864. [24] JI Z, CHEN K X, HE Y Q, et al. Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval[J]. Science China Information Sciences, 2022, 65(7): 172104. [25] ZHENG Z D, ZHENG L, GARRETT M, et al. Dual-Path convolutional image-text embeddings with instance loss[J]. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16(2): 1-23. [26] HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C.,USA:IEEE Press,2018: 6163-6171. [27] CHEN H, DING G G, LIU X D, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C.,USA:IEEE Press,2020: 12652-12660. [28] 杨晓宇, 李超, 陈舜尧, 等. 基于Transformer的图文跨模态检索算法[J]. 计算机科学, 2023, 50(4): 141-148. YANG X Y, LI C, CHEN S Y, et al. Text-image cross-modal retrieval based on Transformer[J]. Computer Science, 2023, 50(4): 141-148. (in Chinese) [29] 梁彦鹏, 刘雪儿, 马忠贵, 等. 嵌入共识知识的因果图文检索方法[J]. 工程科学学报, 2024(2): 317-328. LIANG Y P, LIU X E, MA Z G, et al. Causal image-text retrieval embedded with consensus knowledge[J]. Chinese Journal of Engineering, 2024(2): 317-328. (in Chinese) [30] 廖律超, 邹伟东, 杨佳龙, 等. 基于注意力机制和微分跟踪器的宽度学习系统[J]. 深圳大学学报(理工版), 2024, 41(5): 583-593. LIAO L C, ZOU W D, YANG J L, et al. Broad learning system based on attention mechanism and tracking differentiator[J]. Journal of Shenzhen University (Science and Engineering), 2024, 41(5): 583-593. (in Chinese) |