作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 222-229, 238. doi: 10.19678/j.issn.1000-3428.0065885

• 图形图像处理 • 上一篇    下一篇

基于注意力机制与多尺度池化的实时语义分割网络

王卓, 瞿绍军*   

  1. 湖南师范大学 信息科学与工程学院, 长沙 410081
  • 收稿日期:2022-09-30 出版日期:2023-10-15 发布日期:2023-01-03
  • 通讯作者: 瞿绍军
  • 作者简介:

    王卓(2000—),女,硕士研究生,CCF会员,主研方向为语义分割、计算机视觉

  • 基金资助:
    国家自然科学基金(12071126)

Real-Time Semantic Segmentation Network Based on Attention Mechanism and Multi-Scale Pooling

Zhuo WANG, Shaojun QU*   

  1. College of Information Science and Engineering, Hunan Normal University, Changsha 410081, China
  • Received:2022-09-30 Online:2023-10-15 Published:2023-01-03
  • Contact: Shaojun QU

摘要:

现有语义分割算法在精确度方面表现良好,但在速度上难以满足实时性要求。为提升网络分割速度同时确保高精确度,提出一种新型实时语义分割网络。设计融合通道注意力模块,先通过最大池化和平均池化捕捉全局特征,对池化后的特征图进行级联、卷积和变形以得到各通道权重,再将原特征图与各通道权重进行矩阵乘法操作,得到融合通道权重。将融合通道权重与原特征图进行元素级乘法操作,保证各通道权重与原特征图有效融合。提出一种轻量化金字塔场景解析模块,使用多尺度池化操作充分捕捉多尺度目标特征,在原金字塔场景解析模块的基础上减少池化后的特征图通道数,从而降低计算量。池化后特征图以级联方式连接,利用输入特征图引导连接后的特征图,以有效融合高层和低层特征图。在公共图像数据集Cityscapes上进行实验,结果表明,该网络在验证集、测试集上的准确率分别达到74.6%、73.8%,分割速度达到60.6帧/s,分割性能优于ICNet、DFANet-A等网络。

关键词: 语义分割, 全局特征, 注意力机制, 金字塔场景解析, 多尺度池化

Abstract:

Existing semantic segmentation algorithms achieve high accuracy but their performance in real-time scenarios is insufficient owing to their low speed. Therefore, a new real-time semantic segmentation network is proposed to improve speed and ensure accuracy in network segmentation. First, Fusion Channel Attention Module(FCAM)is designed, largest and average pooling are applied to capture features. Through the cascade, convolution, and reshape operations, the weights of each channel is obtained. Subsequently, matrix multiplication of the original feature map and weights of each channel is performed to obtain the fused channel weights. Finally, element-level multiplication is performed between the fused channel weight and original feature map to ensure that the weight of each channel is effectively integrated with the original feature map. Additionally, a lightweight pyramid scene parsing module is designed based on the original pyramid scene parsing module. This uses a multi-scale pooling operation to fully capture the multi-scale characteristics of a target, which reduces the number of channels of the feature map in a cascaded manner and the amount of computation. Feature map after pooling connected in cascade way, an input feature figure is utilized to lead the connected feature map to learn integrating the high- and low- level feature maps effectively. Experiments conducted on the Cityscapes public image dataset show that the network achieves an accuracy of 74.6% and 73.8% on the validation and test sets, respectively, with a segmentation speed of 60.6 frame/s. Moreover, the segmentation performance is better than that of networks such as ICNet and DFANet-A.

Key words: semantic segmentation, global feature, attention mechanism, pyramid scene parsing, multi-scale pooling