作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于自适应动态聚合的细粒度图文匹配方法

  • 出版日期:2026-03-24 发布日期:2026-03-24

Fine-Grained Image-Text Matching via Adaptive Dynamic Aggregation

  • Online:2026-03-24 Published:2026-03-24

摘要: 细粒度图文匹配技术通过对齐图像中的区域和句子中的单词等视觉语义片段,来实现高质量的图文匹配。虽然现有研究在区域-单词对齐层面取得了显著进展,但在文本单词聚合环节中,依然存在聚合策略难以适应文本长度和单词语义分布的问题,这会导致语义信息丢失,最终降低整体匹配精度。为解决这一问题,本研究提出一种轻量动态聚合器(Lightweight Dynamic Aggregator, LDA),LDA由一个微型神经网络和Softmax函数组成,它通过分析文本长度与单词语义分布,动态生成求和与均值聚合的权重。LDA网络首先将输入的文本特征投影到高维空间,之后进行非线性变换以捕捉复杂交互,随后再映射回低维空间来压缩特征。为防止特征信息在变换过程中丢失,网络采用残差连接以增强信息流,最终通过Softmax函数进行归一化来稳定权重。实验结果表明,所提出的方法在公开数据集上优于现有先进算法。在Flickr30K数据集上,本文方法的检索总分和文本检索图像方向的所有指标均取得最优结果,其中R@1指标提升2.1%。在MS-COCO数据集的1K和5K测试集上的检索总分为最优结果,且在两个方向的所有指标上,均表现出持平或者更优的性能,同时仅引入可忽略的额外计算开销。该工作不仅验证了文本长度与语义分布联合优化在聚合环节的重要性,也为细粒度图文匹配提供了一种高效、鲁棒的聚合新思路。

Abstract: Fine-grained image-text matching technology achieves high-quality image-text matching by aligning visual semantic fragments such as regions in images and words in sentences. Although existing studies have made significant progress at the region-word alignment level, in the text-word aggregation link, there still exists the problem that the aggregation strategy is difficult to adapt to the text length and the semantic distribution of words, which will lead to the loss of semantic information and ultimately reduce the overall matching accuracy. To solve this problem, this study proposes a Lightweight Dynamic Aggregator (LDA). The LDA consists of a micro neural network and a Softmax function. It dynamically generates the weights for summation and mean aggregation by analyzing the text length and the semantic distribution of words. The LDA network first projects the input text features into a high-dimensional space and performs nonlinear transformation to capture complex interactions, and then maps them back to a low-dimensional space to compress the features. To prevent the loss of feature information during the transformation process, the network uses residual connections to enhance the information flow, and finally normalizes through the Softmax function to stabilize the weights. The experimental results show that the proposed method outperforms the existing advanced algorithms on public datasets. On the Flickr30K dataset, the proposed method achieves the best overall score and top performance on all metrics in the text-to-image retrieval direction, with a 2.1% improvement on R@1. On the 1K and 5K test sets of the MS-COCO dataset, the retrieval total score was the best result, and in all metrics of the two directions, it demonstrated comparable or superior performance, while only introducing negligible additional computational overhead. This work not only verifies the significance of the joint optimization of text length and semantic distribution in the aggregation stage, but also provides an efficient and robust new aggregation idea for fine-grained image-text matching.