Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

A Referring Camouflaged Object Detection Method Based on Text-Image Multimodal Fusion Network

  

  • Online:2026-05-29 Published:2026-05-29

基于文本-图像多模态融合网络的参考伪装目标检测方法

Abstract: 参考伪装目标检测(Ref-COD)旨在依托参考图像或文本,精准分割指定伪装目标,是伪装目标检测领域的新型任务。大部分现有方法仅采用单一模态参考信息,在多源参考信息融合及跨模态特征适配方面存在明显局限,难以充分发挥参考指导价值。为此,本文提出一种基于文本-图像多模态融合的Ref-COD网络(TIFNet),实现多源信息高效利用与精细检测。首先,通过金字塔视觉Transformer(PVT)编码器、冻结显著目标检测(SOD)编码器及对比语言-图像预训练(CLIP)编码器,分别提取输入图像、参考图像及参考文本的多阶段特征;设计多键值参考融合模块(MRFM),完成跨模态特征对齐与深度融合,强化参考信息定向指导作用;引入参考空间通道增强模块(RSCM),从双维度实现融合特征与参考特征的双向互增强,消解模态差异;最后利用参考自适应归一化模块(RANM),聚焦关键像素细节,提升模型对多样化伪装场景的自适应能力。大量实验结果表明,该方法相较于近年来主流最优(SOTA)方法,已在R2C7K数据集上的 、 、 、 评价指标上分别取得了0.869、0.929、0.786、0.022的结果,展现出了显著的优势,有效提升了复杂场景下指定伪装目标的分割精度与鲁棒性,为多源信息驱动的伪装目标检测提供了新思路。

关键词: Referring Camouflaged Object Detection (Ref-COD), a novel task in the field of camouflaged object detection, aims to accurately segment specified camouflaged objects relying on reference images or texts. Existing methods mostly adopt only single-modal reference information, exhibiting obvious limitations in multi-source reference information fusion and cross-modal feature adaptation, which makes it difficult to give full play to the guiding value of references. To this end, a Ref-COD network based on text-image multimodal fusion (TIFNet) is proposed to realize efficient utilization of multi-source information and fine-grained detection. Firstly, multi-stage features of input images, reference images and reference texts are extracted respectively by the Pyramid Vision Transformer (PVT) encoder, frozen Salient Object Detection (SOD) encoder and Contrastive Language-Image Pretraining (CLIP) encoder. A Multi-Key-Value Reference Fusion Module (MRFM) is designed to complete cross-modal feature alignment and deep fusion, enhancing the directional guiding effect of reference information. A Reference Spatial-Channel Enhancement Module (RSCM) is introduced to achieve bidirectional mutual enhancement of fused features and reference features from dual dimensions, eliminating modal differences. Finally, a Reference Adaptive Normalization Module (RANM) is utilized to focus on key pixel details and improve the model's adaptability to diverse camouflage scenarios. Extensive experiments demonstrate that our method outperforms recent mainstream SOTA approaches on the R2C7K dataset, achieving scores of 0.869, 0.929, 0.786 and 0.022 for metrics 、 、 、, respectively. It shows remarkable superiority in segmenting specified camouflaged objects under complex scenarios, offering novel insights into multi-source information-driven camouflaged object detection.