作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

多模态对比学习在弱监督语义分割的方法研究

  • 发布日期:2025-11-13

Research on Weakly Supervised Semantic Segmentation Methods Based on Multi-modal Contrastive Learning

  • Published:2025-11-13

摘要: 针对基于对比语言-图像预训练(CLIP)的弱监督语义分割(WSSS)中存在的图像细粒度语义对齐不足、文本上下文局部细节感知有限和伪标签图像局部细节感知不足、噪声扩散等关键问题,提出特征融合对比学习框架FFCLIP,该框架以冻结的CLIP模型为骨干,通过引入全景感知注意力(PPA)、矩形校准模块(RCM)和加权跨模态融合(WFF)三大创新模块,显著提升了跨模态语义对齐能力、局部边界感知能力以及伪标签质量。基于CLIP主干网络构建的多阶段弱监督语义分割训练架构,在VOC2012验证集和测试集上分别取得76.9%和77.5%的mIoU,较主流方法CTI提升2.8%和4.3%;在COCO2014数据集上的达到47.1%的mIoU,显著优于CPAL等基线模型。实验表明,FFCLIP在保持低计算成本(新增6M参数,显存占用峰值6.2GB)的同时,显著提升了弱监督条件下的语义分割精度,为多模态学习与弱监督分割的融合提供了新思路。代码链接:https://github.com/xuwudang/FFCLIP

Abstract: The study addresses the critical challenges in weakly supervised semantic segmentation (WSSS) based on contrastive language-image pre-training (CLIP), such as inadequate fine-grained semantic alignment of images, limited perception of local details in text context, and insufficient local detail perception along with noise propagation in pseudo-label images. To tackle these issues, we propose the Feature Fusion Contrastive Learning framework (FFCLIP), a novel architecture that leverages a frozen CLIP model as the backbone and integrates three innovative modules—Panoramic Perception Attention (PPA), Rectangular Calibration Module (RCM), and Weighted Cross-modal Fusion (WFF)—to effectively enhance cross-modal semantic alignment, refine local boundary perception, and improve the quality of generated pseudo-labels. The multi-stage weakly supervised semantic segmentation training framework based on the CLIP backbone network achieved mIoU scores of 76.9% and 77.5% on the VOC2012 validation and test sets, respectively, surpassing the mainstream method CTI by 2.8% and 4.3%. On the COCO2014 dataset, it attains an mIoU of 47.1%, significantly outperforming baseline models like CPAL. Experimental results demonstrate that FFCLIP substantially enhances semantic segmentation accuracy under weak supervision while maintaining low computational overhead, with only 6M additional parameters and a peak GPU memory consumption of 6.2GB, thereby offering a novel direction for integrating multi-modal learning with weakly supervised segmentation. Code link: https://github.com/xuwudang/FFCLIP