多模态对比学习在弱监督语义分割的方法研究

doi:10.19678/j.issn.1000-3428.0252846

摘要/Abstract

摘要： 针对基于对比语言-图像预训练（CLIP）的弱监督语义分割（WSSS）中存在的图像细粒度语义对齐不足、文本上下文局部细节感知有限和伪标签图像局部细节感知不足、噪声扩散等关键问题，提出特征融合对比学习框架FFCLIP，该框架以冻结的CLIP模型为骨干，通过引入全景感知注意力（PPA）、矩形校准模块（RCM）和加权跨模态融合（WFF）三大创新模块，显著提升了跨模态语义对齐能力、局部边界感知能力以及伪标签质量。基于CLIP主干网络构建的多阶段弱监督语义分割训练架构，在VOC2012验证集和测试集上分别取得76.9%和77.5%的mIoU，较主流方法CTI提升2.8%和4.3%；在COCO2014数据集上的达到47.1%的mIoU，显著优于CPAL等基线模型。实验表明，FFCLIP在保持低计算成本（新增6M参数，显存占用峰值6.2GB）的同时，显著提升了弱监督条件下的语义分割精度，为多模态学习与弱监督分割的融合提供了新思路。代码链接：https://github.com/xuwudang/FFCLIP

Abstract: The study addresses the critical challenges in weakly supervised semantic segmentation (WSSS) based on contrastive language-image pre-training (CLIP), such as inadequate fine-grained semantic alignment of images, limited perception of local details in text context, and insufficient local detail perception along with noise propagation in pseudo-label images. To tackle these issues, we propose the Feature Fusion Contrastive Learning framework (FFCLIP), a novel architecture that leverages a frozen CLIP model as the backbone and integrates three innovative modules—Panoramic Perception Attention (PPA), Rectangular Calibration Module (RCM), and Weighted Cross-modal Fusion (WFF)—to effectively enhance cross-modal semantic alignment, refine local boundary perception, and improve the quality of generated pseudo-labels. The multi-stage weakly supervised semantic segmentation training framework based on the CLIP backbone network achieved mIoU scores of 76.9% and 77.5% on the VOC2012 validation and test sets, respectively, surpassing the mainstream method CTI by 2.8% and 4.3%. On the COCO2014 dataset, it attains an mIoU of 47.1%, significantly outperforming baseline models like CPAL. Experimental results demonstrate that FFCLIP substantially enhances semantic segmentation accuracy under weak supervision while maintaining low computational overhead, with only 6M additional parameters and a peak GPU memory consumption of 6.2GB, thereby offering a novel direction for integrating multi-modal learning with weakly supervised segmentation. Code link: https://github.com/xuwudang/FFCLIP

徐海喆, 黄凌霄, 姚新波, 高勇占, 周开元. 多模态对比学习在弱监督语义分割的方法研究[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252846.

XU Haizhe, HUANG Lingxiao, YAO Xinbo, GAO Yongzhan, ZHOU Kaiyuan. Research on Weakly Supervised Semantic Segmentation Methods Based on Multi-modal Contrastive Learning[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252846.

参考文献

[1]Ahn J, Kwak S. Learning pixel-level semantic affinity with image-level supervision for weakly supervised semantic segmentation[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 4981-4990.
[2]Zhang B, Xiao J, Wei Y, et al. Credible dual-expert learning for weakly supervised semantic segmentation[J]. International Journal of Computer Vision, 2023, 131(8): 1892-1908.
[3]Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[4]Zhang B, Xiao J, Jiao J, et al. Affinity attention graph neural network for weakly supervised semantic segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(11): 8082-8096.
[5]Lin Y, Chen M, Wang W, et al. Clip is also an efficient segmenter: A text-driven approach for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15305-15314.
[6]Xie J, Hou X, Ye K, et al. Clims: Cross language image matching for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4483-4492.
[7]Zhang B, Yu S, Wei Y, et al. Frozen clip: A strong backbone for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3796-3806.
[8]Bearman A, Russakovsky O, Ferrari V, et al. What’s the point: Semantic segmentation with point supervision[C]//European conference on computer vision. Cham: Springer International Publishing, 2016: 549-565.
[9]Deng J, Dong W, Socher R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009: 248-255.
[10]Song C, Huang Y, Ouyang W, et al. Box-driven class-wise region masking and filling rate guided loss for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2019: 3136-3145.
[11]Ru L, Zhan Y, Yu B, et al. Learning affinity from attention: End-to-end weakly-supervised semantic segmentation with transformers[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16846-16855.
[12]Zhang B, Xiao J, Wei Y, et al. End-to-end weakly supervised semantic segmentation with reliable region mining[J]. Pattern Recognition, 2022, 128: 108663.
[13]Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[J]. arXiv preprint arXiv:2010.11929, 2020.
[14]Loshchilov I, Hutter F. Decoupled weight decay regularization[J]. arXiv preprint arXiv:1711.05101, 2017.
[15]NGUYEN Q, VU T, TRAN A, et al. Dataset diffusion: Diffusion-based synthetic dataset generation for pixel-level semantic segmentation[J]. International Journal of Computer Vision, 2023, 131(9): 1-15.
[16]Zhang G, Wang L, Kang G, et al. Slca: Slow learner with classifier alignment for continual learning on a pre-trained model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 19148-19158.
[17]Peng, Zelin, et al. "Usage: A unified seed area generation paradigm for weakly supervised semantic segmentation." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023.
[18]Zhou T, Zhang M, Zhao F, et al. Regional semantic contrast and aggregation for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4299-4309.
[19]Jiang P T, Yang Y, Hou Q, et al. L2g: A simple local-to-global knowledge transfer framework for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 16886-16896.
[20]Wang C, Xu R, Xu S, et al. Treating pseudo-labels generation as image matting for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 755-765.
[21]Wu T, Gao G, Huang J, et al. Adaptive spatial-bce loss for weakly supervised semantic segmentation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 199-216.
[22]Lee, Jungbeom, et al. "Reducing information bottleneck for weakly supervised semantic segmentation." Advances in neural information processing systems 34 (2021): 27408-27421.
[23]Lee, Jungbeom, et al. "Weakly supervised semantic segmentation using out-of-distribution data." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022.
[24]Li J, Jie Z, Wang X, et al. Expansion and shrinkage of localization for weakly-supervised semantic segmentation[J]. Advances in neural information processing systems, 2022, 35: 16037-16051.
[25]Ru L, Du B, Zhan Y, et al. Weakly-supervised semantic segmentation with visual words learning and hybrid pooling[J]. arXiv preprint arXiv:2202.04812, 2022.
[26]Yoon S H, Kweon H, Cho J, et al. Adversarial erasing framework via triplet with gated pyramid pooling layer for weakly supervised semantic segmentation[C]//European conference on computer vision. Cham: Springer Nature Switzerland, 2022: 326-344.
[27]Xu L, Ouyang W, Bennamoun M, et al. Multi-class token transformer for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4310-4319.
[28]Kweon H, Yoon S H, Yoon K J. Weakly supervised semantic segmentation via adversarial learning of classifier and reconstructor[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 11329-11339.
[29]Rong S, Tu B, Wang Z, et al. Boundary-enhanced co-training for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 19574-19584.
[30]Chen L, Lei C, Li R, et al. Fpr: False positive rectification for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 1108-1118.
[31]Araslanov N, Roth S. Single-stage semantic segmentation from image labels[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 4253-4262.
Zhang B, Xiao J, Wei Y, et al. Reliability does matter: An end-to-end weakly supervised semantic segmentation
[32]approach[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(07): 12765-12772.
[33]白雪冰,车进,吴金蔓,等.基于Transformer视觉特征融合的图像描述方法[J].计算机工程,2024,50(08):229-238.DOI:10.19678/j.issn.1000-3428.0068402.
[34]Pan J, Zhu P, Zhang K, et al. Learning self-supervised low-rank network for single-stage weakly and semi-supervised semantic segmentation[J]. International Journal of Computer Vision, 2022, 130(5): 1181-1195.
[35]Xu R, Wang C, Sun J, et al. Self correspondence distillation for end-to-end weakly-supervised semantic segmentation[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2023, 37(3): 3045-3053.
[36]Ru L, Zheng H, Zhan Y, et al. Token contrast for weakly-supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 3093-3102.
[37]Ni Z, Chen X, Zhai Y, et al. Context-guided spatial feature reconstruction for efficient semantic segmentation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2024: 239-255.
[38]Kim Y W, Kim W. Clustering-guided class activation for weakly supervised semantic segmentation[J]. IEEE Access, 2024, 12: 4871-4880.
[39]高磊,蒋海龙,闵帆,等.融合多尺度注意力与空洞卷积的面波噪声压制[J/OL].计算机工程,1-10[2025-10-10].https://doi.org/10.19678/j.issn.1000-3428.0252514.
[40]丁一鹏,赵璐.遥感图像语义分割中的弱监督域自适应算法[J].计算机工程与应用,2022,58(22):195-202.
[41]Ding Yipeng, Zhao Lu. Weakly Supervised Domain Adaptation Algorithm for Semantic Segmentation of Remote Sensing Images [J]. Computer Engineering and Applications, 2022, 58(22): 195-202.
[42]Zhang B, Xiao J, Wei Y, et al. Credible dual-expert learning for weakly supervised semantic segmentation[J]. International Journal of Computer Vision, 2023, 131(8): 1892-1908.
[43]栾晓梅,刘恩海,武鹏飞,等.基于边缘增强的遥感图像弱监督语义分割方法[J].计算机工程与应用,2022,58(20):188-196.
[44]Luan Xiaomei, Liu Enhai, Wu Pengfei, et al. Weakly Supervised Semantic Segmentation Method for Remote Sensing Images Based on Edge Enhancement [J]. Computer Engineering and Applications, 2022, 58(20): 188-196.
[45]李彦青,朱宏擎.基于注意力增强的双域多模态磁共振图像重建[J/OL].计算机工程,1-15[2025-07-30].https://doi.org/10.19678/j.issn.1000-3428.0252289.
[46]Li Yanqing, Zhu Hongqing. Dual-domain multimodal magnetic resonance image reconstruction based on attention enhancement [J/OL]. Computer Engineering, 1-15 [2025-07-30]. https://doi.org/10.19678/j.issn.1000-3428.0252289.
[47]温静,马皓江,李敏智.基于多尺度融合-跨维度交互轻量级深度估计[J/OL].计算机工程,1-14[2025-07-30].https://doi.org/10.19678/j.issn.1000-3428.0252094.
[48]Wen Jing, Ma Haojiang, Li Minzhi. Lightweight Depth Estimation Based on Multi-scale Fusion and Cross-dimensional Interaction [J/OL]. Computer Engineering, 1-14 [2025-07-30]. https://doi.org/10.19678/j.issn.1000-3428.0252094.
[49]Wang Y, Zhang J, Kan M, et al.Self-supervised equivariant attention mechanism for weakly supervised semantic segmentation[J].Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, :12275-12284
[50]胡高珍,徐胜军,孟月波,等.基于边缘约
束局部区域MRF的图像分割方法[J].计算机工程,2021,47(06):253-261+270.DOI:10.19678/j.issn.1000-3428.0056414. [51]Hu Gaozhen, Xu Shengjun, Meng Yuebo, et al. Image Segmentation Method Based on Edge Constraint Local Region MRF [J]. Computer Engineering, 2021, 47(06): 253-261+270. DOI: 10.19678/j.issn.1000-3428.0056414.
[52]郝宏达,罗健旭.基于多尺度区域特征融合的多器官语义分割模型[J].计算机工程,2025,51(08):270-280.DOI:10.19678/j.issn.1000-3428.0069269.
[53]苏晓东,李世洲,赵佳圆,等.基于多级叠加和注意力机制的图像语义分割[J].计算机工程,2023,49(09):265-271+278.DOI:10.19678/j.issn.1000-3428.0065940. Su Xiaodong, Li Shizhou, Zhao Jiayuan, et al. Image Semantic Segmentation Based on Multi-level Superposition and Attention
[54]Mechanism [J]. Computer Engineering, 2023, 49(09): 265-271+278. DOI: 10.19678/j.issn.1000-3428.0065940.
[55]刘洲峰,李冰芮,杨瑞敏,等.基于调制-全局推理的弱监督语义分割算法研究[J].计算机工程,2025,51(02):344-355.DOI:10.19678/j.issn.1000-3428.0068781.
[56]Liu Zhoufeng, Li Bingrui, Yang Ruimin, et al. Research on Weakly Supervised Semantic Segmentation Algorithm Based on Modulation-Global Inference [J]. Computer Engineering, 2025, 51(02): 344-355. DOI: 10.19678/j.issn.1000-3428.0068781.
[57]Wu Y, Ye X, Yang K, et al. Dupl: Dual student with trustworthy progressive learning for robust weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3534-3543.
[58]Tang F, Xu Z, Qu Z, et al. Hunting attributes: Context prototype-aware learning for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3324-3334.
[59]Yoon S H, Kwon H, Kim H, et al. Class tokens infusion for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3595-3605.
[60]Zhao X, Yang Z, Dai T, et al. PSDPM: Prototype-based secondary discriminative pixels mining for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 3437-3446.
[61]Yang Z, Fu K, Duan M, et al. Separate and conquer: Decoupling co-occurrence via decomposition and representation for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2024: 3606-3615.
[62]马月坤,马铭佑.基于全局与局部特征加权融合的隐喻识别模型[J].计算机工程,2025,51(05):143-153.DOI:10.19678/j.issn.1000-3428.0068521.
[63]Ma Yuekun, Ma Mingyou. A Metaphor Recognition Model Based on Weighted Fusion of Global and Local Features [J]. Computer Engineering, 2025, 51(05): 143-153. DOI: 10.19678/j.issn.1000-3428.0068521.
[64]Chen Q, Yang L, Lai J H, et al. Self-supervised image-specific prototype exploration for weakly supervised semantic segmentation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 4288-4298.
[65]Ma T, Zhang A. Affinitynet: semi-supervised few-shot learning for disease type prediction[C]//Proceedings of the AAAI conference on artificial intelligence. 2019, 33(01): 1069-1076.
[66]Sun G, Wang W, Dai J, et al. Mining cross-image semantics for weakly supervised semantic segmentation[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 347-365.

选择文件类型/文献管理软件名称

选择包含的内容