作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于跨域特征解耦与语义原型引导的图文检索方法

  • 发布日期:2026-01-05

Image-Text Retrieval via Cross-Domain Feature Disentanglement and Semantic Prototype Guidance

  • Published:2026-01-05

摘要: 跨模态图文检索作为多模态理解的核心任务之一,面对图像与文本在模态表达、语义抽象层级和结构组织等方面的天然异质性,如何实现高精度语义对齐与模态间鸿沟的跨越成为当前研究的关键挑战。为此,本文提出一种跨域特征解耦与语义原型引导的图文检索模型(DPNet),旨在提升复杂场景下的细粒度图文匹配能力与检索鲁棒性。该模型设计了频域-空间联合解耦、层次化语义增强以及双模态交互注意力机制,实现跨模态特征的结构化重构与判别表达的增强。针对传统方法难以兼顾空间结构与频域纹理的建模缺陷,提出的频域-空间解耦模块采用异构多头注意力机制,在保留局部空间语义的同时挖掘全局周期模式,实现视觉特征的多维协同表达;为弥补局部词汇与全局语义对齐的失衡,语义增强模块融合词性标注与深度可分离卷积,引导模型聚焦关键语义区域,提升对事实描述与主观评价等语义模式的建模能力;此外,针对训练样本不平衡和噪声敏感问题,提出的动态边界三元组损失可自适应调整相似度判别边界,结合语义原型对比学习进一步增强类内紧致性与类间分离性。在Flickr30K与MSCOCO两个主流数据集上的实验结果表明,所提方法在细粒度图文检索任务中在MSCOCO数据集上的R@1、R@5、R@10指标上分别取得1.0%、0.1%、0.2%和1.4%、0.6%、0.3%的性能提升,显著优于现有主流方法。本研究为复杂跨模态场景下的高精度、实时检索提供了一种高效可行的解决思路。

Abstract: Cross-modal image-text retrieval, a core task in multimodal understanding, faces inherent heterogeneity between images and texts in modal expression, semantic abstraction levels, and structural organization. How to achieve high-precision semantic alignment and bridge the cross-modal gap is a key challenge in current research. To address this, this paper proposes DPNet, an image-text retrieval model based on cross-domain feature decoupling and semantic prototype guidance, aiming to enhance fine-grained image-text matching and retrieval robustness in complex scenarios.The model is designed with frequency-spatial joint decoupling, hierarchical semantic enhancement, and dual-modal interactive attention mechanisms, realizing the structured reconstruction of cross-modal features and the enhancement of discriminative expression. To tackle the modeling flaw that traditional methods struggle to balance spatial structure and frequency-domain texture modeling, the proposed frequency-spatial decoupling module adopts a heterogeneous multi-head attention mechanism. It preserves local spatial semantics while mining global periodic patterns, achieving multi-dimensional collaborative expression of visual features. To compensate for the imbalance between local vocabulary and global semantic alignment, the semantic enhancement module integrates part-of-speech tagging and depthwise separable convolution, guiding the model to focus on key semantic regions and improving its ability to model semantic patterns like factual descriptions and subjective evaluations.Additionally, to address imbalanced training samples and noise sensitivity, the proposed dynamic boundary triplet loss adaptively adjusts the similarity discrimination boundary. Combined with semantic prototype contrastive learning, it further enhances intra-class compactness and inter-class separability. Experimental results on Flickr30K and MSCOCO show that the proposed method achieves 1.0%, 0.1%, 0.2% and 1.4%, 0.6%, 0.3% improvements in R@1, R@5, R@10 metrics respectively on MSCOCO for fine-grained image-text retrieval, significantly outperforming existing state-of-the-art methods. This study provides an efficient and feasible solution for high-precision and real-time retrieval in complex cross-modal scenarios.