基于交叉注意力与特征聚合的跨模态图文检索研究

doi:10.19678/j.issn.1000-3428.0070119

计算机工程 ›› 2026, Vol. 52 ›› Issue (2): 311-321. doi: 10.19678/j.issn.1000-3428.0070119

基于交叉注意力与特征聚合的跨模态图文检索研究

杨钰雪¹, 何甜¹, 樊京杭¹, 刘瑞英¹, 李腾²^,*()

1. 国网河北省电力有限公司信息通信分公司, 河北石家庄 050000
2. 华北电力大学计算机系, 河北保定 071003

收稿日期:2024-07-15 修回日期:2024-09-02 出版日期:2026-02-15 发布日期:2025-03-11
通讯作者: 李腾
作者简介:
杨钰雪，女，工程师、硕士，主研方向为云计算、大数据技术、能源大数据研究和管理
何甜，工程师、硕士
樊京杭，工程师、硕士
刘瑞英，工程师、硕士
李腾(通信作者)，硕士研究生
基金资助:
国网河北省电力有限公司科技项目(SGHEXT00YJJS2310459)

Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation

YANG Yuxue¹, HE Tian¹, FAN Jinghang¹, LIU Ruiying¹, LI Teng²^,*()

1. Information and Communication Branch of State Grid Hebei Electric Power Co., Ltd., Shijiazhuang 050000, Hebei, China
2. Department of Computer, North China Electric Power University, Baoding 071003, Hebei, China

Received:2024-07-15 Revised:2024-09-02 Online:2026-02-15 Published:2025-03-11
Contact: LI Teng

摘要/Abstract

摘要：

目前, 图文检索已经成为跨模态领域的一个重要研究方向, 但现有的将多种模态特征聚合的方式面临着模态间特征对齐不充分和模态内语义表征损失的两大挑战。针对跨模态检索领域模态内特征信息的表征问题, 提出一种基于交叉注意力与特征聚合的跨模态图文检索模型。该模型包含图文特征提取、交叉注意力、特征池化、特征融合等模块, 结合三元组损失函数挖掘图文局部信息, 以获得具有深层次语义关系的图文特征表示。模型采用注意力融合策略, 通过可学习权重参数调控图像与文本细粒度特征的融合。设计一种特征池化模块, 分别聚合图像区域特征和文本序列特征, 并通过神经网络学习权重参数, 结合多重相似度共同指导模型学习, 该模块可以灵活地处理图文变长序列的特征, 增强模型对跨模态信息的捕捉能力。在公共数据集MS COCO和Flickr 30k上进行对比实验, 结果表明, 与多种图文检索模型相比, 该模型在同类方法中检索性能更高, 其在语义特征池化降维方面具有优势, 为跨模态特征融合提供了新思路。

关键词: 跨模态检索, 交叉注意力, 图文匹配, 特征池化, 特征融合

Abstract:

Image-text retrieval has become an important research direction in cross modal fields. However, the existing methods of aggregating multiple modal features face two major challenges: insufficient feature alignment between modalities and semantic representation loss within modalities. A cross modal image-text retrieval model based on cross attention and feature aggregation is proposed to address the problem of representation of feature information within modalities. This model includes modules such as image and text feature extraction, cross attention, feature pooling, and feature fusion. It combines the triplet loss function to mine local information in images and text, for obtaining image and text feature representations with deep semantic relationships. The model adopts an attention fusion strategy, which regulates the fusion of fine-grained features between images and texts using learnable weight parameters. A feature pooling module that aggregates image region features and text sequence features separately, learns weight parameters through neural networks, and combines multiple similarities to guide model learning is designed. This module can flexibly handle the features of variable length sequences of images and text, enhancing the ability of the model to capture cross modal information. Comparative experiments conducted on the public datasets MS COCO and Flickr 30k, reveal that compared with various image and text retrieval models, this model has higher retrieval performance. It has advantages in semantic feature pooling and dimensionality reduction, providing new concepts for cross modal feature fusion.

Key words: cross-modal retrieval, cross attention, image-text matching, feature pooling, feature fusion

杨钰雪, 何甜, 樊京杭, 刘瑞英, 李腾. 基于交叉注意力与特征聚合的跨模态图文检索研究[J]. 计算机工程, 2026, 52(2): 311-321.

YANG Yuxue, HE Tian, FAN Jinghang, LIU Ruiying, LI Teng. Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation[J]. Computer Engineering, 2026, 52(2): 311-321.

https://www.ecice06.com/CN/Y2026/V52/I2/311

图/表 11

图1 本文模型结构

Fig.1 The structure of the model in this paper

图2 动态生成权重的示意图

Fig.2 Schematic diagram of dynamically generating weights

图3 不同模型的Friedman检验图

Fig.3 Friedman test graph for different models

图4 图文相似度热图

Fig.4 Image-text similarity heatmap

图5 图像检索文本示例1

Fig.5 Example 1 of image retrieval text

图6 图像检索文本示例2

Fig.6 Example 2 of image retrieval text

图7 文本检索图像示例

Fig.7 Example of text retrieval image

参考文献 30

1	LOWE D G . Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 2004, 60 (2): 91- 110. doi: 10.1023/B:VISI.0000029664.99615.94
2	ZHANG Y , JIN R , ZHOU Z H . Understanding bag-of-words model: a statistical framework. International Journal of Machine Learning and Cybernetics, 2010, 1 (1): 43- 52.
3	JELODAR H , WANG Y L , YUAN C , et al. Latent Dirichlet Allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools and Applications, 2019, 78 (11): 15169- 15211. doi: 10.1007/s11042-018-6894-4
4	HARDOON D R , SZEDMAK S , SHAWE-TAYLOR J . Canonical correlation analysis: an overview with application to learning methods. Neural Computation, 2004, 16 (12): 2639- 2664. doi: 10.1162/0899766042321814
5	ZHENG W M , ZHOU X Y , ZOU C R , et al. Facial expression recognition using Kernel Canonical Correlation Analysis (KCCA). IEEE Transactions on Neural Networks, 2006, 17 (1): 233- 238. doi: 10.1109/TNN.2005.860849
6	BENTON A, KHAYRALLAH H, GUJRAL B, et al. Deep generalized canonical correlation analysis[C]//Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019). [S. l. ]: ACL, 2019: 1-6.
7	高迪辉, 盛立杰, 许小冬, 等. 图文跨模态检索的联合特征方法. 西安电子科技大学学报, 2024, 51 (4): 128- 138.
	GAO D H , SHENG L J , XU X D , et al. Joint feature approach for image-text cross-modal retrieval. Journal of Xidian University, 2024, 51 (4): 128- 138.
8	LU J S, BATRA D, PARIKH D, et al. ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[EB/OL]. [2024-05-05]. https://arxiv.org/abs/1908.02265.
9	RADFORD A, KIM J W, HALLACY C, et al. Learning transferable visual models from natural language supervision[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2103.00020.
10	LI J N, LI D X, XIONG C M, et al. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation[EB/OL]. [2024-05-05]. https://arxiv.org/abs/2201.12086.
11	LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 40th International Conference on Machine Learning. New York, USA: ACM Press, 2023: 19730-19742.
12	LEE K H, CHEN X, HUA G, et al. Stacked cross attention for image-text matching[EB/OL]. [2024-05-05]. https://arxiv.org/abs/1803.08024.
13	ZHANG K, MAO Z D, WANG Q, et al. Negative-aware attention framework for image-text matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 15640-15649.
14	YANG J Y, DUAN J L, TRAN S, et al. Vision-language pre-training with triple contrastive learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 15650-15659.
15	LIU Z , PEI X L , GAO S S , et al. Perceive, reason, and align: context-guided cross-modal correlation learning for image-text retrieval. Applied Soft Computing, 2024, 154, 111395. doi: 10.1016/j.asoc.2024.111395
16	KRISHNA R , ZHU Y K , GROTH O , et al. Visual genome: connecting language and vision using crowd sourced dense image annotations. International Journal of Computer Vision, 2017, 123 (1): 32- 73. doi: 10.1007/s11263-016-0981-7
17	REN S Q , HE K M , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
18	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-05-05]. https://arxiv.org/abs/1810.04805.
19	CHEN J C, HU H X, WU H, et al. Learning the best pooling strategy for visual semantic embedding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 15784-15793.
20	DELIÈGE A, ISTASSE M, KUMAR A, et al. Ordinal pooling[C]//Proceedings of the 30th British Machine Vision Conference. Cardiff, UK: BMVA Press, 2019: 76.
21	KARPATHY A, LI F F. Deep visual-semantic alignments for generating image descriptions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2015: 3128-3137.
22	LI K P, ZHANG Y L, LI K, et al. Visual semantic reasoning for image-text matching[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 4653-4661.
23	MITHUN N C, PANDA R, PAPALEXAKIS E E, et al. Webly supervised joint embedding for cross-modal image-text retrieval[C]//Proceedings of the 26th ACM International Conference on Multimedia. New York, USA: ACM Press, 2018: 1856-1864.
24	JI Z , CHEN K X , HE Y Q , et al. Heterogeneous memory enhanced graph reasoning network for cross-modal retrieval. Science China Information Sciences, 2022, 65 (7): 172104. doi: 10.1007/s11432-021-3367-y
25	ZHENG Z D , ZHENG L , GARRETT M , et al. Dual-Path convolutional image-text embeddings with instance loss. ACM Transactions on Multimedia Computing, Communications, and Applications, 2020, 16 (2): 1- 23.
26	HUANG Y, WU Q, SONG C F, et al. Learning semantic concepts and order for image and sentence matching[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 6163-6171.
27	CHEN H, DING G G, LIU X D, et al. IMRAM: iterative matching with recurrent attention memory for cross-modal image-text retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 12652-12660.
28	杨晓宇, 李超, 陈舜尧, 等. 基于Transformer的图文跨模态检索算法. 计算机科学, 2023, 50 (4): 141- 148.
	YANG X Y , LI C , CHEN S Y , et al. Text-image cross-modal retrieval based on Transformer. Computer Science, 2023, 50 (4): 141- 148.
29	梁彦鹏, 刘雪儿, 马忠贵, 等. 嵌入共识知识的因果图文检索方法. 工程科学学报, 2024 (2): 317- 328.
	LIANG Y P , LIU X E , MA Z G , et al. Causal image-text retrieval embedded with consensus knowledge. Chinese Journal of Engineering, 2024 (2): 317- 328.
30	廖律超, 邹伟东, 杨佳龙, 等. 基于注意力机制和微分跟踪器的宽度学习系统. 深圳大学学报(理工版), 2024, 41 (5): 583- 593.
	LIAO L C , ZOU W D , YANG J L , et al. Broad learning system based on attention mechanism and tracking differentiator. Journal of Shenzhen University (Science and Engineering), 2024, 41 (5): 583- 593.

[1]	李健浪, 吴新电, 陈灵, 阳波, 唐文胜. 基于4D毫米波雷达与视觉融合的三维目标检测算法[J]. 计算机工程, 2026, 52(2): 299-310.
[2]	王庆荣, 郝福乐, 朱昌锋, 王俊杰. 基于多特征融合的车辆轨迹预测研究[J]. 计算机工程, 2026, 52(2): 331-341.
[3]	宋泉臻, 陈作钧, 秦品乐, 曾建潮. 基于超像素引导的Transformer低光图像去噪方法[J]. 计算机工程, 2026, 52(2): 186-196.
[4]	孙圆, 王康平, 赵鸣博. 基于多提示和图文对比学习的服装检索[J]. 计算机工程, 2026, 52(2): 322-330.
[5]	刘畅, 梁冰雪, 田荣坤, 秦玉华. 基于多特征融合和混合神经网络的医疗健康问题分类[J]. 计算机工程, 2026, 52(2): 342-355.
[6]	黎东丰, 陈雨人, 余博. 基于多层次特征融合的路面裂缝检测方法[J]. 计算机工程, 2026, 52(1): 154-165.
[7]	邹少华, 刘笑嶂, 李修来. 融合交叉注意力和双特征交互的红外船舶目标检测模型[J]. 计算机工程, 2026, 52(1): 390-399.
[8]	刘杰, 黄晓辉, 郭敬博. 基于YOLOv8的轻量级田间棉花品级检测[J]. 计算机工程, 2026, 52(1): 400-413.
[9]	马跃, 黄周睿, 周雯, 许艺瀚. 基于感受野注意力的轻量化林火检测算法[J]. 计算机工程, 2025, 51(9): 350-361.
[10]	闫建红, 刘芝妍, 王震. 融合时空注意力机制的多尺度卷积车辆轨迹预测[J]. 计算机工程, 2025, 51(8): 406-414.
[11]	陈晓雷, 王荣. 多分支多尺度点云补全网络[J]. 计算机工程, 2025, 51(8): 330-340.
[12]	马满福, 陈嘉豪, 李勇, 张聪. 基于改进GAT的多特征融合谣言检测模型MFLAN[J]. 计算机工程, 2025, 51(8): 181-189.
[13]	刘春霞, 孟吉星, 潘理虎, 龚大立. 融合RGB与IR图像的遥感小目标检测方法[J]. 计算机工程, 2025, 51(7): 326-338.
[14]	栾孟娜, 郑秋梅, 王风华. 基于DMC-YOLO的交通标志实时检测算法[J]. 计算机工程, 2025, 51(7): 90-99.
[15]	周莎, 车生兵, 考友琛, 张旭, 郭甚驿. 基于特征选择和时空特征的网络入侵检测[J]. 计算机工程, 2025, 51(7): 223-231.

选择文件类型/文献管理软件名称

选择包含的内容

基于交叉注意力与特征聚合的跨模态图文检索研究

Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于交叉注意力与特征聚合的跨模态图文检索研究

Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价