作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (2): 256-265. doi: 10.19678/j.issn.1000-3428.0067177

• 图形图像处理 • 上一篇    下一篇

引入轻量级Transformer的自适应窗口立体匹配算法

王正家1,2, 胡飞飞1,2,*(), 张成娟1,2, 雷卓1,2, 何涛1,2   

  1. 1. 现代制造质量工程湖北省重点实验室, 湖北 武汉 430068
    2. 湖北工业大学机械工程学院, 湖北 武汉 430068
  • 收稿日期:2023-03-15 出版日期:2024-02-15 发布日期:2023-07-04
  • 通讯作者: 胡飞飞
  • 基金资助:
    国家自然科学基金(51275158)

Adaptive Window Stereo Matching Algorithm with Lightweight Transformer

Zhengjia WANG1,2, Feifei HU1,2,*(), Chengjuan ZHANG1,2, Zhuo LEI1,2, Tao HE1,2   

  1. 1. Hubei Key Laboratory of Modern Manufacturing Quality Engineering, Wuhan 430068, Hubei, China
    2. School of Mechanical Engineering, Hubei University of Technology, Wuhan 430068, Hubei, China
  • Received:2023-03-15 Online:2024-02-15 Published:2023-07-04
  • Contact: Feifei HU

摘要:

现有端到端的立体匹配算法为了减轻显存消耗和计算量而预设固定视差范围,在匹配精度和运行效率上难以平衡。提出一种基于轻量化Transformer的自适应窗口立体匹配算法。利用具有线性复杂度的坐标注意力层对低分辨率特征图进行位置编码,减轻计算量并增强相似特征的辨别力;设计轻量化Transformer特征描述模块,转换上下文相关的特征,并引入可分离多头自注意力层对Transformer进行轻量化改进,降低Transformer的延迟性;用可微匹配层对特征进行匹配,设计自适应窗口匹配细化模块进行亚像素级的匹配细化,在提高匹配精度的同时减少显存消耗;经视差回归后生成无视差范围的视差图。在KITTI2015、KITTI2012和SceneFlow数据集上的对比实验表明,该算法比基于标准Transformer的STTR在匹配效率上快了近4.7倍,具有更快的运行速度和更友好的存储性能;比基于3D卷积的PSMNet误匹配率降低了18%,运行时间快了5倍,实现了更好的速度和精度的平衡。

关键词: 立体匹配, Transformer, 自适应窗口, 可分离自注意力机制, 坐标注意力

Abstract:

The existing end-to-end stereo matching algorithms preset a fixed disparity range to reduce memory consumption and computation, making it difficult to balance matching accuracy and running efficiency.To solve this problem, this paper proposes an adaptive window stereo matching algorithm based on a lightweight Transformer. The coordinate attention layer with linear complexity is used to encode the position of the low-resolution feature map, which reduces the amount of calculation and enhances the discrimination of similar features. The lightweight Transformer feature description module is designed to convert context-related features, and a separable Multi-Head Self-Attention(MHSN) layer is introduced to reduce Transformer delay. The differentiable matching layer is used to match the features, and an adaptive window matching and refinement module is designed to perform sub-pixel matching and refinement, which improves matching accuracy and reduces video memory consumption, whereby after disparity regression, a disparity map can be generated regardless of the disparity range.The comparative experiments on KITTI2015, KITTI2012, and SceneFlow datasets showed that the proposed stereo matching algorithm is approximately 4.7 times faster than the standard Transformer-based STTR in matching efficiency and has friendlier storage performance. Compared with the PSMNet based on 3D convolution method, the mismatching rate was reduced by 18% and the running time was five times faster, achieving a better balance between speed and accuracy.

Key words: stereo matching, Transformer, adaptive window, separable self-attention mechanism, coordinate attention