作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (4): 197-207. doi: 10.19678/j.issn.1000-3428.0067217

• 图形图像处理 • 上一篇    下一篇

基于双通道Transformer的地铁站台异物检测

刘瑞康, 刘伟铭, 段梦飞, 谢玮, 戴愿   

  1. 华南理工大学土木与交通学院, 广东 广州 510640
  • 收稿日期:2023-03-20 修回日期:2023-07-06 发布日期:2023-08-09
  • 通讯作者: 刘瑞康,E-mail:liuruikanglin@163.com E-mail:liuruikanglin@163.com
  • 基金资助:
    国家"十三五"重点研发计划(2016YFB1200402)。

Metro Platform Foreign Object Detection Based on Dual-channel Transformer

LIU Ruikang, LIU Weiming, DUAN Mengfei, XIE Wei, DAI Yuan   

  1. School of Civil Engineering and Transportation, South China University of Technology, Guangzhou 510640, Guangdong, China
  • Received:2023-03-20 Revised:2023-07-06 Published:2023-08-09

摘要: Transformer因其全局注意力优势在异物检测上取得了比卷积神经网络(CNN)更具竞争力的结果,但依然面临计算成本高、输入图像块尺寸固定、局部与全局信息交互匮乏等问题。提出一种基于双通道Transformer骨干网络、金字塔轻量化Transformer块和通道交叉注意力机制的DualFormer模型,用以检测地铁站台屏蔽门与列车门间隙中存在的异物。针对输入图像块尺寸固定的问题,提出双通道策略,通过设计2种不同的特征提取通道对不同尺度的输入图像块进行特征提取,增强网络对粗、细粒度特征的提取能力,提高对多尺度目标的识别精度;针对计算成本高的问题,构建金字塔轻量化Transformer块,将级联卷积引入到多头自注意力(MHSA)模块中,并利用卷积的维度压缩能力来降低模型的计算成本;针对局部与全局信息交互匮乏的问题,提出通道交叉注意力机制,利用提取到的粗细粒度特征在通道层面进行交互,优化局部与全局信息在网络中的权重。在标准化地铁异物检测数据集上的实验结果表明,DualFormer模型参数量为1.98×107,实现了89.7%的精度和24帧/s的速度,优于对比的Transformer检测算法。

关键词: 视觉Transformer, 异物检测, 双通道策略, 金字塔轻量化Transformer块, 注意力融合

Abstract: Recently, Transformers have achieved more competitive results than Convolutional Neural Network(CNN) in foreign object detection owing to their global self-attention advantages. However, they still face problems such as high computing costs, a fixed scale of input image patches, and less interaction between local and global information. To address the aforementioned challenges, a DualFormer model that incorporates a dual-channel Transformer backbone, pyramid lightweight Transformer blocks, and a channel cross-attention mechanism is proposed. The model aims to detect foreign objects in the gap between metro platform screen and train doors. A dual-channel strategy is proposed to address the fixed input image patch size issue by designing two feature extraction channels to extract features from input image patches of various scales, thus improving the ability of the network to extract both coarse-grained and fine-grained features and enhancing the recognition accuracy of multiscale targets. To address the issue of high computational cost, a pyramid lightweight Transformer block is proposed, which introduces cascaded convolution into the Multi-Head Self-Attention(MHSA) module and leverages the dimensionality compression capability of the convolution to decrease the computational cost of the model. Regarding the issue of limited interaction between local and global information, a channel cross-attention mechanism is proposed, which allows coarse-grained and fine-grained features to interact at the channel level and optimizes the weight allocation of local and global information in the network. The results demonstrate that DualFormer has a mean average precision of 89.7% on the standardized metro anomaly detection dataset with a detection speed of 24 frame/s and 1.98×107 model parameters, which is superior to those of existing Transformer detection algorithms.

Key words: Vision Transformer(ViT), foreign object detection, dual-channel strategy, pyramid lightweight Transformer block, attention fusion

中图分类号: