Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2026, Vol. 52 ›› Issue (2): 311-321. doi: 10.19678/j.issn.1000-3428.0070119

• Multimodal Information Fusion • Previous Articles    

Research on Cross-Modal Image-Text Retrieval Based on Cross Attention and Feature Aggregation

YANG Yuxue1, HE Tian1, FAN Jinghang1, LIU Ruiying1, LI Teng2   

  1. 1. Information and Communication Branch of State Grid Hebei Electric Power Co., Ltd., Shijiazhuang 050000, Hebei, China;
    2. Department of Computer, North China Electric Power University, Baoding 071003, Hebei, China
  • Received:2024-07-15 Revised:2024-09-02 Published:2025-03-11

基于交叉注意力与特征聚合的跨模态图文检索研究

杨钰雪1, 何甜1, 樊京杭1, 刘瑞英1, 李腾2   

  1. 1. 国网河北省电力有限公司信息通信分公司, 河北 石家庄 050000;
    2. 华北电力大学计算机系, 河北 保定 071003
  • 作者简介:杨钰雪,女,工程师、硕士,主研方向为云计算、大数据技术、能源大数据研究和管理;何甜、樊京杭、刘瑞英,工程师、硕士;李腾(通信作者),硕士研究生。E-mail:gfdr5@ncepu.edu.cn
  • 基金资助:
    国网河北省电力有限公司科技项目(SGHEXT00YJJS2310459)。

Abstract: Image-text retrieval has become an important research direction in cross modal fields. However, the existing methods of aggregating multiple modal features face two major challenges: insufficient feature alignment between modalities and semantic representation loss within modalities. A cross modal image-text retrieval model based on cross attention and feature aggregation is proposed to address the problem of representation of feature information within modalities. This model includes modules such as image and text feature extraction, cross attention, feature pooling, and feature fusion. It combines the triplet loss function to mine local information in images and text, for obtaining image and text feature representations with deep semantic relationships. The model adopts an attention fusion strategy, which regulates the fusion of fine-grained features between images and texts using learnable weight parameters. A feature pooling module that aggregates image region features and text sequence features separately, learns weight parameters through neural networks, and combines multiple similarities to guide model learning is designed. This module can flexibly handle the features of variable length sequences of images and text, enhancing the ability of the model to capture cross modal information. Comparative experiments conducted on the public datasets MS COCO and Flickr 30k, reveal that compared with various image and text retrieval models, this model has higher retrieval performance. It has advantages in semantic feature pooling and dimensionality reduction, providing new concepts for cross modal feature fusion.

Key words: cross-modal retrieval, cross attention, image-text matching, feature pooling, feature fusion

摘要: 目前,图文检索已经成为跨模态领域的一个重要研究方向,但现有的将多种模态特征聚合的方式面临着模态间特征对齐不充分和模态内语义表征损失的两大挑战。针对跨模态检索领域模态内特征信息的表征问题,提出一种基于交叉注意力与特征聚合的跨模态图文检索模型。该模型包含图文特征提取、交叉注意力、特征池化、特征融合等模块,结合三元组损失函数挖掘图文局部信息,以获得具有深层次语义关系的图文特征表示。模型采用注意力融合策略,通过可学习权重参数调控图像与文本细粒度特征的融合。设计一种特征池化模块,分别聚合图像区域特征和文本序列特征,并通过神经网络学习权重参数,结合多重相似度共同指导模型学习,该模块可以灵活地处理图文变长序列的特征,增强模型对跨模态信息的捕捉能力。在公共数据集MS COCO和Flickr 30k上进行对比实验,结果表明,与多种图文检索模型相比,该模型在同类方法中检索性能更高,其在语义特征池化降维方面具有优势,为跨模态特征融合提供了新思路。

关键词: 跨模态检索, 交叉注意力, 图文匹配, 特征池化, 特征融合

CLC Number: