掩模特征融合: 实例分割新范式

doi:10.19678/j.issn.1000-3428.0069013

摘要/Abstract

摘要：

实例分割任务是视觉场景理解的基本任务之一, 现有的算法具有一定的相似性, 通过梳理现有算法中的共通性与差异性, 抽象出一种新颖的实例分割范式: 掩模特征融合(MFF)。该范式将实例分割任务分为语义无关的掩模特征提取、语义相关的序列提取以及序列特征和掩模特征融合3个模块。进一步, 根据新范式的结构特性提出2项优化。首先, 通过设计一个非局部全局偏置增强骨干网络对全局信息的关注, 使掩模特征提取模块在网络浅层可以提取到全局的信息, 并且消除预训练权重带来的数据集固有偏置。其次, 实验过程中观察到一些Transformer模型在训练初期出现查询向量不稳定的现象, 即多数查询向量的感兴趣区域(ROI)在每次交叉注意力操作后会发生漂移现象。为了解决查询向量漂移的问题, 针对序列提取模块提出一种去噪训练的方法, 保证查询向量的注意力在训练前期就可以保持在同一区域, 从而加速Transformer解码器的收敛, 并在其他参数配置相同的情况下提高模型精度。实验结果证明了上述改进的有效性。在MS-COCO2017数据集上的实例分割任务中, 相比MMF范式的基础模型, 增加了新的改进措施后, 模型在掩模平均精度均值(mAP)指标上取得了5.0%的显著性能提升。

关键词: 实例分割范式, 掩模特征融合, 非局部全局偏置, 去噪训练, 查询向量漂移

Abstract:

Instance segmentation is a fundamental task in understanding visual scenes. Existing algorithms exhibit certain similarities and differences. By analyzing these similarities and differences, this paper proposes a novel instance segmentation paradigm called Mask Feature Fusion (MFF). This paradigm divides the instance segmentation task into three modules: extraction of semantically independent mask features, extraction of semantically related sequences, and fusion of sequence features with mask features. Building on the structural characteristics of MFF, two optimizations are proposed. First, by designing a non-local global bias, the focus of the backbone network on global information is enhanced. This allows the mask feature extraction module to access global information at shallow network levels and mitigates dataset inherent biases introduced by pretrained weights. Second, during experiment, instability in the query vectors is observed in some Transformer models during the early training stages. Specifically, the Regions of Interest (ROIs) for most query vectors shift after each cross-attention operation. To address this issue, a denoising training method is introduced for the sequence extraction module. This method ensures that the attention of the query vectors remains focused on the same area in the early stages of training, thereby accelerating the convergence of the Transformer decoder and enhancing model precision under identical parameter configurations. Experimental results conclusively demonstrate the effectiveness of these improvements. Specifically, in the instance segmentation task on the MS-COCO2017 dataset, compared with the foundational model of MFF paradigm, after adding new improvement measures, the model exhibits a notable increase of 5.0% in the mask mean Average Precision (mAP) metric.

Key words: instance segmentation paradigm, Mask Feature Fusion (MFF), non-local global bias, denoising training, query vector shifting

李伟康, 张思全. 掩模特征融合: 实例分割新范式[J]. 计算机工程, 2025, 51(2): 126-138.

LI Weikang, ZHANG Siquan. Mask Feature Fusion: New Paradigm of Instance Segmentation[J]. Computer Engineering, 2025, 51(2): 126-138.

https://www.ecice06.com/CN/Y2025/V51/I2/126

图/表 14

图1 MFF范式示意图

Fig.1 Illustration of MMF paradigm

图2 序列编码器

Fig.2 Sequential endcoder

图3 特征融合头

Fig.3 Feature fuse head

图4 全局偏置示意图

Fig.4 Illustration of global bias

图5 训练2 000次迭代更新后查询向量漂移可视化

Fig.5 Illustration of query shifting after 2 000 iterations

图6 去噪训练

Fig.6 Denoising training

图7 MS-COCO2017验证集推理结果

Fig.7 Inference results on MS-COCO2017 validation set

图8 MS-COCO2017验证集上mAP直方图

Fig.8 Histogram of mAP on MS-COCO2017 validation set

图9 不加载预训练权重全局偏置模块指标走向图

Fig.9 Trend chart of the global bias module metrics without pre-trained weight

参考文献 34

1	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]// Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2017: 2980-2988. 10.1109/TPAMI.2018.2844175
2	HUANG Z J, HUANG L C, GONG Y C, et al. Mask scoring R-CNN[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2019: 6402-6411. 10.48550/arXiv.1903.00241
3	CAI Z W, VASCONCELOS N. Cascade R-CNN: delving into high quality object detection[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2018: 6154-6162. 10.1109/CVPR.2018.00644
4	BOLYA D, ZHOU C, XIAO F, et al. YOLACT: real-time instance segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2019: 9156-9165. 10.1109/ICCV.2019.00925
5	BOLYA D , ZHOU C , XIAO F Y , et al. YOLACT++: better real-time instance segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (2): 1108- 1121. doi: 10.1109/TPAMI.2020.3014297
6	WANG X L, KONG T, SHEN C H, et al. SOLO: segmenting objects by locations[EB/OL]. [2023-05-10]. https://arxiv.org/abs/1912.04488.
7	WANG X, ZHANG R, KONG T, et al. SOLOv2: dynamic and fast instance segmentation[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2003.10152.
8	TIAN Z , ZHANG B W , CHEN H , et al. Instance and panoptic segmentation using conditional convolutions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (1): 669- 680. doi: 10.1109/TPAMI.2022.3145407
9	CHENG B W, MISRA L, SCHWING A G, et al. Masked-attention mask transformer for universal image segmentation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2022: 1280-1289. 10.1109/CVPR52688.2022.00135
10	CHENG B, SCHWING A G, KIRILLOV A. Per-pixel classification is not all you need for semantic segmentation[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2107.06278.
11	TIAN Z, SHEN C H, CHEN H, et al. FCOS: fully convolutional one-stage object detection[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2019: 9626-9635. 10.48550/arXiv.1904.01355
12	WU J Z , LIU B , ZHANG H , et al. Fault detection based on Fully Convolutional Networks (FCN). Journal of Marine Science and Engineering, 2021, 9 (3): 259. doi: 10.3390/jmse9030259
13	LIANG F, WU B C, DAI X L, et al. Open-vocabulary semantic segmentation with mask-adapted CLIP[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2023: 7061-7070.
14	ZAREIAN A, DELA ROSA K, HU D H, et al. Open-vocabulary object detection using captions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2021: 14388-14397.
15	ZHANG Z, ZHAO Z, LIN Z, et al. Counterfactual contrastive learning for weakly-supervised vision-language grounding[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2020: 18123-18134.
16	DAI X Y, CHEN Y P, YANG J W, et al. Dynamic DETR: end-to-end object detection with dynamic attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2021: 2968-2977. 10.1109/ICCV48922.2021.00298
17	RONNEBERGER O, FISCHER P, BROX T. U-net: convolutional networks for biomedical image segmentation[C]//Proceedings of MICCAI 2015. Berlin, Germany: Springer, 2015: 234-241. 10.1007/978-3-319-24574-4_28
18	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D. C., USA: IEEE Press, 2021: 9992-10002. 10.1109/ICCV48922.2021.00986
19	CHEN J, LU Y, YU Q, et al. TransUNet: transformers make strong encoders for medical image segmentation[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2102.04306.
20	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2010.11929.
21	FANG W , CHEN Y P , XUE Q Y . Survey on research of RNN-based spatio-temporal sequence prediction algorithms. Journal on Big Data, 2021, 3 (3): 97- 110. doi: 10.32604/jbd.2021.016993
22	SMAGULOVA K , JAMES A P . A survey on LSTM memristive neural network architectures and applications. The European Physical Journal Special Topics, 2019, 228 (10): 2313- 2324. doi: 10.1140/epjst/e2019-900046-x
23	HAN K , WANG Y H , CHEN H T , et al. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (1): 87- 110. doi: 10.1109/TPAMI.2022.3152247
24	REN S Q , HE K M , GIRSHICK R , et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39 (6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
25	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2016: 770-778. 10.1109/CVPR.2016.90
26	TIAN Z, SHEN C H, CHEN H. Conditional convolutions for instance segmentation[C]//Proceedings of European Conference on Computer Vision (ECCV). Berlin, Germany: Springer, 2020: 282-298. 10.48550/arXiv.2003.05664
27	KIRILLOV A, WU Y X, HE K M, et al. PointRend: image segmentation as rendering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2020: 9796-9805.
28	LI F, ZHANG H, LIU S L, et al. DN-DETR: accelerate DETR training by introducing query denoising[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2022: 1-10. 10.48550/arXiv.2203.01305
29	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of European Conference on Computer Vision (ECCV). Berlin, Germany: Springer, 2014: 740-755. 10.48550/arXiv.1405.0312
30	CORDTS M, OMRAN M, RAMOS S, et al. The cityscapes dataset for semantic urban scene understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2016: 3213-3223. 10.48550/arXiv.1604.01685
31	CHEN K, WANG J, PANG J, et al. MMDetection: open MMLab detection toolbox and benchmark[EB/OL]. [2023-05-10]. https://arxiv.org/abs/1906.07155.
32	GHIASI G, CUI Y, SRINIVAS A, et al. Simple copy-paste is a strong data augmentation method for instance segmentation[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2021: 2917-2927. 10.48550/arXiv.2012.07177
33	ZHU X, SU W, LU L, et al. Deformable DETR: deformable transformers for end-to-end object detection[EB/OL]. [2023-05-10]. https://arxiv.org/abs/2010.04159.
34	CHEN K, PANG J, WANG J, et al. Hybrid task cascade for instance segmentation[EB/OL]. [2023-05-10]. https://arxiv.org/abs/1901.07518v1.

选择文件类型/文献管理软件名称

选择包含的内容