作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (2): 126-138. doi: 10.19678/j.issn.1000-3428.0069013

• 人工智能与模式识别 • 上一篇    下一篇

掩模特征融合: 实例分割新范式

李伟康, 张思全*()   

  1. 上海海事大学物流工程学院, 上海 200135
  • 收稿日期:2023-12-26 出版日期:2025-02-15 发布日期:2024-05-20
  • 通讯作者: 张思全
  • 基金资助:
    国家自然科学基金(51175321)

Mask Feature Fusion: New Paradigm of Instance Segmentation

LI Weikang, ZHANG Siquan*()   

  1. Logistics Engineering College, Shanghai Maritime University, Shanghai 200135, China
  • Received:2023-12-26 Online:2025-02-15 Published:2024-05-20
  • Contact: ZHANG Siquan

摘要:

实例分割任务是视觉场景理解的基本任务之一, 现有的算法具有一定的相似性, 通过梳理现有算法中的共通性与差异性, 抽象出一种新颖的实例分割范式: 掩模特征融合(MFF)。该范式将实例分割任务分为语义无关的掩模特征提取、语义相关的序列提取以及序列特征和掩模特征融合3个模块。进一步, 根据新范式的结构特性提出2项优化。首先, 通过设计一个非局部全局偏置增强骨干网络对全局信息的关注, 使掩模特征提取模块在网络浅层可以提取到全局的信息, 并且消除预训练权重带来的数据集固有偏置。其次, 实验过程中观察到一些Transformer模型在训练初期出现查询向量不稳定的现象, 即多数查询向量的感兴趣区域(ROI)在每次交叉注意力操作后会发生漂移现象。为了解决查询向量漂移的问题, 针对序列提取模块提出一种去噪训练的方法, 保证查询向量的注意力在训练前期就可以保持在同一区域, 从而加速Transformer解码器的收敛, 并在其他参数配置相同的情况下提高模型精度。实验结果证明了上述改进的有效性。在MS-COCO2017数据集上的实例分割任务中, 相比MMF范式的基础模型, 增加了新的改进措施后, 模型在掩模平均精度均值(mAP)指标上取得了5.0%的显著性能提升。

关键词: 实例分割范式, 掩模特征融合, 非局部全局偏置, 去噪训练, 查询向量漂移

Abstract:

Instance segmentation is a fundamental task in understanding visual scenes. Existing algorithms exhibit certain similarities and differences. By analyzing these similarities and differences, this paper proposes a novel instance segmentation paradigm called Mask Feature Fusion (MFF). This paradigm divides the instance segmentation task into three modules: extraction of semantically independent mask features, extraction of semantically related sequences, and fusion of sequence features with mask features. Building on the structural characteristics of MFF, two optimizations are proposed. First, by designing a non-local global bias, the focus of the backbone network on global information is enhanced. This allows the mask feature extraction module to access global information at shallow network levels and mitigates dataset inherent biases introduced by pretrained weights. Second, during experiment, instability in the query vectors is observed in some Transformer models during the early training stages. Specifically, the Regions of Interest (ROIs) for most query vectors shift after each cross-attention operation. To address this issue, a denoising training method is introduced for the sequence extraction module. This method ensures that the attention of the query vectors remains focused on the same area in the early stages of training, thereby accelerating the convergence of the Transformer decoder and enhancing model precision under identical parameter configurations. Experimental results conclusively demonstrate the effectiveness of these improvements. Specifically, in the instance segmentation task on the MS-COCO2017 dataset, compared with the foundational model of MFF paradigm, after adding new improvement measures, the model exhibits a notable increase of 5.0% in the mask mean Average Precision (mAP) metric.

Key words: instance segmentation paradigm, Mask Feature Fusion (MFF), non-local global bias, denoising training, query vector shifting