A Referring Camouflaged Object Detection Method Based on Text-Image Multimodal Fusion Network

doi:10.19678/j.issn.1000-3428.0260217

Abstract

Abstract: 参考伪装目标检测（Ref-COD）旨在依托参考图像或文本，精准分割指定伪装目标，是伪装目标检测领域的新型任务。大部分现有方法仅采用单一模态参考信息，在多源参考信息融合及跨模态特征适配方面存在明显局限，难以充分发挥参考指导价值。为此，本文提出一种基于文本-图像多模态融合的Ref-COD网络（TIFNet），实现多源信息高效利用与精细检测。首先，通过金字塔视觉Transformer（PVT）编码器、冻结显著目标检测（SOD）编码器及对比语言-图像预训练（CLIP）编码器，分别提取输入图像、参考图像及参考文本的多阶段特征；设计多键值参考融合模块（MRFM），完成跨模态特征对齐与深度融合，强化参考信息定向指导作用；引入参考空间通道增强模块（RSCM），从双维度实现融合特征与参考特征的双向互增强，消解模态差异；最后利用参考自适应归一化模块（RANM），聚焦关键像素细节，提升模型对多样化伪装场景的自适应能力。大量实验结果表明，该方法相较于近年来主流最优（SOTA）方法，已在R2C7K数据集上的、、、评价指标上分别取得了0.869、0.929、0.786、0.022的结果，展现出了显著的优势，有效提升了复杂场景下指定伪装目标的分割精度与鲁棒性，为多源信息驱动的伪装目标检测提供了新思路。

关键词: Referring Camouflaged Object Detection (Ref-COD), a novel task in the field of camouflaged object detection, aims to accurately segment specified camouflaged objects relying on reference images or texts. Existing methods mostly adopt only single-modal reference information, exhibiting obvious limitations in multi-source reference information fusion and cross-modal feature adaptation, which makes it difficult to give full play to the guiding value of references. To this end, a Ref-COD network based on text-image multimodal fusion (TIFNet) is proposed to realize efficient utilization of multi-source information and fine-grained detection. Firstly, multi-stage features of input images, reference images and reference texts are extracted respectively by the Pyramid Vision Transformer (PVT) encoder, frozen Salient Object Detection (SOD) encoder and Contrastive Language-Image Pretraining (CLIP) encoder. A Multi-Key-Value Reference Fusion Module (MRFM) is designed to complete cross-modal feature alignment and deep fusion, enhancing the directional guiding effect of reference information. A Reference Spatial-Channel Enhancement Module (RSCM) is introduced to achieve bidirectional mutual enhancement of fused features and reference features from dual dimensions, eliminating modal differences. Finally, a Reference Adaptive Normalization Module (RANM) is utilized to focus on key pixel details and improve the model's adaptability to diverse camouflage scenarios. Extensive experiments demonstrate that our method outperforms recent mainstream SOTA approaches on the R2C7K dataset, achieving scores of 0.869, 0.929, 0.786 and 0.022 for metrics 、、、, respectively. It shows remarkable superiority in segmenting specified camouflaged objects under complex scenarios, offering novel insights into multi-source information-driven camouflaged object detection.

XU Han, YE Shan, DAI Qiuju, DING Yajun, WANG Runmin. A Referring Camouflaged Object Detection Method Based on Text-Image Multimodal Fusion Network[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260217.

许涵, 叶杉, 戴秋菊, 丁亚军, 王润民. 基于文本-图像多模态融合网络的参考伪装目标检测方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260217.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260217

References

[1] Pang Y, Zhao X, Xiang T Z, et al. Zoom in and out: A mixed-scale triplet network for camouflaged object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 2160-2170.
[2] Pang Y, Zhao X, Xiang T Z, et al. Zoomnext: A unified collaborative pyramid network for camouflaged object detection[J]. IEEE transactions on pattern analysis and machine intelligence, 2024, 46(12): 9205-9220.
[3] Sun Y J, Wang S, Chen C L Z, et al. Boundary-Guided Camouflaged Object Detection [C]//Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence. 2022:1335-1341.
[4] Xie C, Xia C, Yu T, et al. Frequency representation integration for camouflaged object detection[C]//Proceedings of the 31st ACM International Conference on Multimedia. 2023: 1789-1797.
[5] Zhang X, Zhang L, Hou Q, et al. Referring Camouflaged Object Detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, DOI:10.1109/TPAMI.2025.3532440.
[6] Gupta A, Jerripothula K R, Tillo T. CIRCOD: Co-Saliency Inspired Referring Camouflaged Object Discovery[C]//2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). IEEE, 2025: 8313-8323.
[7] Wu R, Xiang T Z, Xie G S, et al. Uncertainty-aware transformer for referring camouflaged object detection[J]. IEEE Transactions on Image Processing, 2025.
[8] 刘春霞, 孟吉星, 潘理虎, 龚大立. 融合RGB与IR图像的遥感小目标检测方法[J]. 计算机工程, 2025, 51(7): 326-338. LIU Chunxia, MENG Jixing, PAN Lihu, GONG Dali. Remote Sensing Small-Target Detection Method with Fusion of RGB and IR Images[J]. Computer Engineering, 2025, 51(7): 326-338.
[9] Lin J, Zhou X F, Liu J Y, et al. YOLO-TLA: An Efficient Lightweight Model for Small Object Detection Based on YOLOv5[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2026:11234-11242.
[10] 秦鹏, 唐川明, 刘云峰, 张建林, 徐智勇. 基于改进YOLOv3的红外目标检测方法[J]. 计算机工程, 2022, 48(3): 211-219. QIN Peng, TANG Chuanming, LIU Yunfeng, ZHANG Jianlin, XU Zhiyong. Infrared Target Detection Method Based on Improved YOLOv3[J]. Computer Engineering, 2022, 48(3): 211-219.
[11] Gündoğan M M, Aksoy T, Temizel A, et al. IR Reasoner: Real-Time Infrared Object Detection by Visual Reasoning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2023:422-430.
[12] 肖锋,李茹娜. 语义信息引导下的显著目标检测算法[J]. 计算机工程, 2019, 45(4): 248-253. XIAO Feng,LI Runa. Salient Object Detection Algorithm Under Guidance of Semantic Information[J]. Computer Engineering, 2019, 45(4): 248-253.
[13] Hussain T, Anwar A, Anwar S, et al. Pyramidal Attention for Saliency Detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. 2022:2878-2888.
[14] Fan D P, Ji G P, Sun G, et al. Camouflaged object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 2777-2787.
[15] Zhai Q, Li X, Yang F, et al. Mutual graph learning for camouflaged object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 12997-13007.
[16] He R, Dong Q, Lin J, et al. Weakly-supervised camouflaged object detection with scribble annotations[C]//Proceedings of the AAAI conference on artificial intelligence. 2023, 37(1): 781-789.
[17] Li A, Zhang J, Lv Y, et al. Uncertainty-aware joint salient object and camouflaged object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 10071-10081.
[18] Mei H, Ji G P, Wei Z, et al. Camouflaged object segmentation with distraction mining[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 8772-8781.
[19] Cheng Y, Hao H Z, Ji Y, et al. Attention-based neighbor selective aggregation network for camouflaged object detection[C]//2022 International Joint Conference on Neural Networks (IJCNN). IEEE, 2022: 1-8.
[20] Fan D P, Ji G P, Xu P, et al. Advances in deep concealed scene understanding[J]. Visual Intelligence, 2023, 1(1): 16.
[21] Liu X, Huang S, Wu R, et al. Reference Prompted Model Adaptation for Referring Camouflaged Object Detection[C]//2024 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2024: 1-6.
[22] Wang W, Xie E, Li X, et al. Pvt v2: Improved baselines with pyramid vision transformer[J]. Computational visual media, 2022, 8(3): 415-424.
[23] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. PmLR, 2021: 8748-8763.
[24] Zhuge M, Fan D P, Liu N, et al. Salient object detection via integrity learning[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45(3): 3738-3752.
[25] Hu J, Shen L, Sun G. Squeeze-and-Excitation Networks [C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2018:7132-7141.
[26] Woo S, Park J, Lee J Y, et al. CBAM: Convolutional Block Attention Module [C]//Proceedings of the European Conference on Computer Vision. 2018:3-19.
[27] Huang X, Belongie S. Arbitrary style transfer in real-time with adaptive instance normalization[C]//Proceedings of the IEEE International Conference on Computer Vision. 2017:1501-1510.
[28] Ronneberger O, Fischer P, Brox T. U-Net: Convolutional Networks for Biomedical Image Segmentation[C]//Proceedings of the Medical Image Computing and Computer-Assisted Intervention. 2015:234-241.
[29] Yu J, Jiang Y, Wang Z, et al. UnitBox: An Advanced Object Detection Network[C]//Proceedings of the ACM International Conference on Multimedia. 2016:516-520.
[30] Fan D P, Cheng M M, Liu Y, et al. Structure-measure: A new way to evaluate foreground maps[J]. IEEE Transactions on Image Processing, 2018, 27(10):4548-4561.
[31] Fan D P, Gong C, Cao Y, et al. Enhanced-alignment Measure for Binary Foreground Map Evaluation[J]. IEEE Transactions on Image Processing, 2020, 29:3016-3029.
[32] Jiang H Z, Wang J D, Yuan Z J, et al. Salient object detection: A discriminative regional feature integration approach[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2013:2083-2090.
[33] Borji A, Cheng M M, Jiang H, et al. Salient region detection: A benchmark[J]. IEEE Transactions on Image Processing, 2015, 24(12):5706-5722.
[34] Luo Z, Liu N, Zhao W, et al. Vscode: General visual salient and camouflaged object detection with 2d prompt learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 17169-17180.
[35] Sun Y, Xuan H, Yang J, et al. Glconet: Learning multisource perception representation for camouflaged object detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024.
[36] Zhang M, Xu S, Piao Y, et al. Preynet: Preying on camouflaged objects[C]//Proceedings of the 30th ACM International Conference on Multimedia. 2022: 5323-5332.
[37] Fan D P, Ji G P, Cheng M M, et al. Concealed object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(10): 6024-6042.
[38] Zhu H, Li P, Xie H, et al. I can find you! boundary-guided separated attention network for camouflaged object detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2022, 36(3): 3608-3616.
[39] Ji G P, Fan D P, Chou Y C, et al. Deep gradient learning for efficient camouflaged object detection[J]. Machine Intelligence Research, 2023, 20(1): 92-108.
[40] Sun Y, Xu C, Yang J, et al. Frequency-spatial entanglement learning for camouflaged object detection[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 343-360.
[41] Gatys L A, Ecker A S, Bethge M. Universal Style Transfer via Feature Transforms[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017:1899-1907.
[42] Li Y, Fang C, Yang J, et al. Linear Style Transfer with Non-linear Feature Transformations[C]//Proceedings of the European Conference on Computer Vision. 2018:394-409.
[43] Gatys L A, Ecker A S, Bethge M. A Neural Algorithm of Artistic Style[J]. IEEE Transactions on Visualization and Computer Graphics, 2016, 22(12):3365-3373.
[44] Huang Z, Dai H, Xiang T Z, et al. Feature shrinkage pyramid for camouflaged object detection with transformers[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 5557-5566.
[45] Yue G, Xiao H, Xie H, et al. Dual-constraint coarse-to-fine network for camouflaged object detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 34(5): 3286-3298.
[46] Khan A, Khan M, Gueaieb W, et al. Camofocus: Enhancing camouflage object detection with split-feature focal modulation and context refinement[C]//Proceedings of the IEEE/cvf winter conference on applications of computer vision. 2024: 1434-1443.
[47] Wang R, Shi C, Duan C, et al. Camouflaged object segmentation with prior via two-stage training[J]. Computer Vision and Image Understanding, 2024, 246: 104061.
[48] Yin B, Zhang X, Fan D P, et al. Camoformer: Masked separable attention for camouflaged object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 10362-10374.
[49] Sun K, Chen Z, Lin X, et al. Conditional diffusion models for camouflaged and salient object detection[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(4): 2833-2848.

Please choose a citation manager

Content to export