基于自适应动态聚合的细粒度图文匹配方法

doi:10.19678/j.issn.1000-3428.0253393

摘要/Abstract

摘要： 细粒度图文匹配技术通过对齐图像中的区域和句子中的单词等视觉语义片段，来实现高质量的图文匹配。虽然现有研究在区域-单词对齐层面取得了显著进展，但在文本单词聚合环节中，依然存在聚合策略难以适应文本长度和单词语义分布的问题，这会导致语义信息丢失，最终降低整体匹配精度。为解决这一问题，本研究提出一种轻量动态聚合器（Lightweight Dynamic Aggregator, LDA），LDA由一个微型神经网络和Softmax函数组成，它通过分析文本长度与单词语义分布，动态生成求和与均值聚合的权重。LDA网络首先将输入的文本特征投影到高维空间，之后进行非线性变换以捕捉复杂交互，随后再映射回低维空间来压缩特征。为防止特征信息在变换过程中丢失，网络采用残差连接以增强信息流，最终通过Softmax函数进行归一化来稳定权重。实验结果表明，所提出的方法在公开数据集上优于现有先进算法。在Flickr30K数据集上，本文方法的检索总分和文本检索图像方向的所有指标均取得最优结果，其中R@1指标提升2.1%。在MS-COCO数据集的1K和5K测试集上的检索总分为最优结果，且在两个方向的所有指标上，均表现出持平或者更优的性能，同时仅引入可忽略的额外计算开销。该工作不仅验证了文本长度与语义分布联合优化在聚合环节的重要性，也为细粒度图文匹配提供了一种高效、鲁棒的聚合新思路。

Abstract: Fine-grained image-text matching technology achieves high-quality image-text matching by aligning visual semantic fragments such as regions in images and words in sentences. Although existing studies have made significant progress at the region-word alignment level, in the text-word aggregation link, there still exists the problem that the aggregation strategy is difficult to adapt to the text length and the semantic distribution of words, which will lead to the loss of semantic information and ultimately reduce the overall matching accuracy. To solve this problem, this study proposes a Lightweight Dynamic Aggregator (LDA). The LDA consists of a micro neural network and a Softmax function. It dynamically generates the weights for summation and mean aggregation by analyzing the text length and the semantic distribution of words. The LDA network first projects the input text features into a high-dimensional space and performs nonlinear transformation to capture complex interactions, and then maps them back to a low-dimensional space to compress the features. To prevent the loss of feature information during the transformation process, the network uses residual connections to enhance the information flow, and finally normalizes through the Softmax function to stabilize the weights. The experimental results show that the proposed method outperforms the existing advanced algorithms on public datasets. On the Flickr30K dataset, the proposed method achieves the best overall score and top performance on all metrics in the text-to-image retrieval direction, with a 2.1% improvement on R@1. On the 1K and 5K test sets of the MS-COCO dataset, the retrieval total score was the best result, and in all metrics of the two directions, it demonstrated comparable or superior performance, while only introducing negligible additional computational overhead. This work not only verifies the significance of the joint optimization of text length and semantic distribution in the aggregation stage, but also provides an efficient and robust new aggregation idea for fine-grained image-text matching.

黄天一, 张聪, 刘仕意, 左嘉怡, 王正. 基于自适应动态聚合的细粒度图文匹配方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253393.

HuangTianyi, ZhangCong, LiuShiyi, ZuoJiayi, WangZheng. Fine-Grained Image-Text Matching via Adaptive Dynamic Aggregation[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253393.

参考文献

[1] 张振兴,王亚雄. 图文跨模态检索研究综述 [J]. 北京交通大学学报, 2024, 48 (02): 23-36. Zhang Zhenxing, Wang Yaxiong. A Survey of Image-Text Cross-Modal Retrieval Research [J]. Journal of Beijing Jiaotong University, 2024, 48(02): 23-36.
[2] Lowe D G. Distinctive Image Features from Scale-Invariant Keypoints[J]. International Journal of Computer Vision, 2004, 60(2): 91-110.
[3] Zhang Y, Jin R, Zhou Z H. Understanding bag-of-words model: a statistical framework[J]. International Journal of Machine Learning and Cybernetics, 2010, 1(1): 43-52.
[4] Hardoon D R, Szedmak S, Shawe-Taylor J. Canonical correlation analysis: An overview with application to learning methods[J]. Neural computation, 2004, 16(12): 2639-2664.
[5] Zheng W, Zhou X, Zou C, et al. Facial expression recognition using kernel canonical correlation analysis(KCCA)[J]. IEEE Transactions on Neural Networks, 2006, 17(1): 233-238
[6] Faghri F, Fleet D J, Kiros J R, et al. VSE++: Improving Visual-Semantic Embeddings with Hard Negatives[C]//Proceedings of the British Machine Vision Conference (BMVC). Newcastle, UK: BMVA Press, 2018: 1-14.
[7] Li Z, Guo C, Wang X, et al. Selectively hard negative mining for alleviating gradient vanishing in image-text matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(2): 1921-1935.
[8] Li Z, Lu H, Fu H, et al. Image-text bidirectional learning network based crossmodal retrieval[J]. Neurocomputing, 2022, 483: 148-159.
[9] Zhang Y, Ji Z, Wang D, et al. USER: Unified semantic enhancement with momentum contrast for image-text retrieval[J]. IEEE Transactions on Image Processing, 2024, 33: 595-609.
[10] Pham K, Huynh C, Lim S N, et al. Composing object relations and attributes for image-text matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2024: 14354-14363.
[11] Lee K H, Chen X, Hua G, et al. Stacked cross attention for image-text matching[C]//Proceedings of the European conference on computer vision (ECCV). Cham, Switzerland: Springer, 2018: 201-216.
[12] Li K, Zhang Y, Li K, et al. Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF international conference on computer vision. Piscataway, USA: IEEE, 2019: 4654-4662.
[13] Pan Z, Wu F, Zhang B. Fine-grained image-text matching by cross-modal hard aligning network[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2023: 19275-19284.
[14] Messina N, Amato G, Esuli A, et al. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders[J]. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 2021, 17(4): 1-23.
[15] Diao H, Zhang Y, Ma L, et al. Similarity reasoning and filtration for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2021, 35(2): 1218-1226.
[16] Zhang K, Mao Z, Wang Q, et al. Negative-aware attention framework for image-text matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2022: 15661-15670.
[17] 杨钰雪,何甜,樊京杭,等. 基于交叉注意力与特征聚合的跨模态图文检索研究 [J/OL]. 计算机工程, 1-12[2025-10-30]. https://doi.org/10.19678/j.issn.1000-3428.0070119. Yang Yuxue, He Tian, Fan Jinghang, et al. Research on Cross-Modal Image-Text Retrieval Based on Cross-Attention and Feature Aggregation [J/OL]. Computer Engineering, 1-12[2025-10-30].https://doi.org/10.19678/j.issn.1000-3428.0070119
[18] Li M, Gao Y, Zhao H, et al. Progressive semantic aggregation and structured cognitive enhancement for image–text matching[J]. Expert Systems with Applications, 2025, 274: 126943.
[19] Wang P, Zhang L, Mao Z, et al. Matryoshka Learning with Metric Transfer for Image-text Matching[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025, 35(9): 9502-9516.
[20] Krishna R, Zhu Y, Groth O, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations[J]. International journal of computer vision, 2017, 123(1): 32-73.
[21] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE transactions on pattern analysis and machine intelligence, 2016, 39(6): 1137-1149.
[22] Anderson P, He X, Buehler C, et al. Bottom-up and top-down attention for image captioning and visual question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition, USA: IEEE, 2018: 6077-6086.
[23] 余超,王铭硕,赵子樵,等. 基于图像相对位置和负向感知的图文匹配 [J]. 现代电子技术, 2024, 47 (17): 88-93. DOI:10.16652/j.issn.1004-373x.2024.17.014. Yu Chao, Wang Mingshuo, Zhao Ziqiao, et al. Image-Text Matching Based on Relative Position of Images and Negative Perception [J]. Modern Electronics Technique, 2024, 47(17): 88-93.
[24] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). Stroudsburg, USA: Association for Computational Linguistics, 2019: 4171-4186.
[25] Chen J, Hu H, Wu H, et al. Learning the best pooling strategy for visual semantic embedding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2021: 15789-15798.
[26] Young P, Lai A, Hodosh M, et al. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions[J]. Transactions of the association for computational linguistics, 2014, 2: 67-78.
[27] Lin T Y, Maire M, Belongie S, et al. Microsoft coco: Common objects in context[C]//European conference on computer vision. Cham: Springer International Publishing, 2014: 740-755.
[28] Wei X, Zhang T, Li Y, et al. Multi-modality cross attention network for image and sentence matching[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Piscataway, USA: IEEE, 2020: 10941-10950.
[29] Zhang H, Mao Z, Zhang K, et al. Show your faith: Cross-modal confidence-aware network for image-text matching[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2022, 36(3): 3262-3270.
[30] Li K, Zhang Y, Li K, et al. Image-text embedding learning via visual and textual semantic reasoning[J]. IEEE transactions on pattern analysis and machine intelligence, 2022, 45(1): 641-656.
[31] Zhu H, Zhang C, Wei Y, et al. ESA: External space attention aggregation for image-text retrieval[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33(10): 6131-6143.
[32] Radenović F, Tolias G, Chum O. Fine-tuning CNN image retrieval with no human annotation[J]. IEEE transactions on pattern analysis and machine intelligence, 2018, 41(7): 1655-1668.

选择文件类型/文献管理软件名称

选择包含的内容