Adaptive Window Stereo Matching Algorithm with Lightweight Transformer

doi:10.19678/j.issn.1000-3428.0067177

Abstract

Abstract:

The existing end-to-end stereo matching algorithms preset a fixed disparity range to reduce memory consumption and computation, making it difficult to balance matching accuracy and running efficiency.To solve this problem, this paper proposes an adaptive window stereo matching algorithm based on a lightweight Transformer. The coordinate attention layer with linear complexity is used to encode the position of the low-resolution feature map, which reduces the amount of calculation and enhances the discrimination of similar features. The lightweight Transformer feature description module is designed to convert context-related features, and a separable Multi-Head Self-Attention(MHSN) layer is introduced to reduce Transformer delay. The differentiable matching layer is used to match the features, and an adaptive window matching and refinement module is designed to perform sub-pixel matching and refinement, which improves matching accuracy and reduces video memory consumption, whereby after disparity regression, a disparity map can be generated regardless of the disparity range.The comparative experiments on KITTI2015, KITTI2012, and SceneFlow datasets showed that the proposed stereo matching algorithm is approximately 4.7 times faster than the standard Transformer-based STTR in matching efficiency and has friendlier storage performance. Compared with the PSMNet based on 3D convolution method, the mismatching rate was reduced by 18% and the running time was five times faster, achieving a better balance between speed and accuracy.

Key words: stereo matching, Transformer, adaptive window, separable self-attention mechanism, coordinate attention

摘要：

现有端到端的立体匹配算法为了减轻显存消耗和计算量而预设固定视差范围，在匹配精度和运行效率上难以平衡。提出一种基于轻量化Transformer的自适应窗口立体匹配算法。利用具有线性复杂度的坐标注意力层对低分辨率特征图进行位置编码，减轻计算量并增强相似特征的辨别力；设计轻量化Transformer特征描述模块，转换上下文相关的特征，并引入可分离多头自注意力层对Transformer进行轻量化改进，降低Transformer的延迟性；用可微匹配层对特征进行匹配，设计自适应窗口匹配细化模块进行亚像素级的匹配细化，在提高匹配精度的同时减少显存消耗；经视差回归后生成无视差范围的视差图。在KITTI2015、KITTI2012和SceneFlow数据集上的对比实验表明，该算法比基于标准Transformer的STTR在匹配效率上快了近4.7倍，具有更快的运行速度和更友好的存储性能；比基于3D卷积的PSMNet误匹配率降低了18%，运行时间快了5倍，实现了更好的速度和精度的平衡。

关键词: 立体匹配, Transformer, 自适应窗口, 可分离自注意力机制, 坐标注意力

Zhengjia WANG, Feifei HU, Chengjuan ZHANG, Zhuo LEI, Tao HE. Adaptive Window Stereo Matching Algorithm with Lightweight Transformer[J]. Computer Engineering, 2024, 50(2): 256-265.

王正家, 胡飞飞, 张成娟, 雷卓, 何涛. 引入轻量级Transformer的自适应窗口立体匹配算法[J]. 计算机工程, 2024, 50(2): 256-265.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067177

http://www.ecice06.com/EN/Y2024/V50/I2/256

Figures/Tables 12

Fig.1 Overall framework of LTAWNet

Fig.2 Feature extraction network structure and position encoding layer structure

Fig.3 WT structure and MHSA layer structure

Fig.4 Comparison of MHSAs

Fig.5 Matching refinement process and formation of adaptive search window

Fig.6 Qualitative comparison results on KITTI2015 dataset

Fig.7 Qualitative comparison results on KITTI2012 dataset

Fig.8 Qualitative comparison results on SceneFlow dataset

References 29

1	ZBONTAR J, LECUN Y. Computing the stereo matching cost with a convolutional neural network[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2015: 1592-1599.
2	MAYER N, ILG E, HAUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 4040-4048.
3	LIANG Z F, FENG Y L, GUO Y L, et al. Learning for disparity estimation through feature constancy[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 2811-2820.
4	XU H F, ZHANG J Y. AANet: adaptive aggregation network for efficient stereo matching[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 1959-1968.
5	LI J K, WANG P S, XIONG P F, et al. Practical stereo matching via cascaded recurrent network with adaptive correlation[C]//Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2022: 16263-16272.
6	NIE G Y, CHENG M M, LIU Y, et al. Multi-level context ultra-aggregation for stereo matching[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 3283-3291.
7	余雪飞, 顾寄南, 黄则栋, 等. 基于边缘检测与注意力机制的立体匹配算法. 电子测量技术, 2022, 45(11): 167- 172.
	YU X F, GU J N, HUANG Z D, et al. Stereo matching algorithm based on edge detection and attention mechanism. Electronic Measurement Technology, 2022, 45(11): 167- 172.
8	赵倩. 基于3D卷积模块和视差分割的立体匹配方法. 电子测量技术, 2021, 44(18): 72- 77.
	ZHAO Q. Research of stereo matching method based on 3D convolution module and parallax segmentation. Electronic Measurement Technology, 2021, 44(18): 72- 77.
9	CAO Y E, XU J R, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]//Proceedings of 2019 IEEE/CVF International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 1-10.
10	CHANG J R, CHEN Y S. Pyramid stereo matching network[C]//Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 5410-5418.
11	CHABRA R, STRAUB J, SWEENEY C, et al. StereoDRNet: dilated residual StereoNet[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 11786-11795.
12	GUO X Y, YANG K, YANG W K, et al. Group-wise correlation stereo network[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 3273-3282.
13	LIU Z, LIN Y T, CAO Y E, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2021: 10012-10022.
14	CARON M, TOUVRON H, MISRA I, et al. Emerging properties in self-supervised vision transformers[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2021: 9650-9660.
15	HUANG Z Y, SHI X Y, ZHANG C, et al. FlowFormer: a transformer architecture for optical flow[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 668-685.
16	SARLIN P E, DETONE D, MALISIEWICZ T, et al. SuperGLUE: learning feature matching with graph neural networks[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 4938-4947.
17	SUN J M, SHEN Z H, WANG Y A, et al. LoFTR: detector-free local feature matching with transformers[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 8922-8931.
18	LI Z S, LIU X T, DRENKOW N, et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]//Proceedings of 2021 IEEE/CVF International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2021: 6197-6206.
19	RAO Z B, HE M Y, DAI Y C, et al. Sliding space-disparity transformer for stereo matching. Neural Computing and Applications, 2022, 34(24): 21863- 21876. doi: 10.1007/s00521-022-07621-7
20	TAY Y, DEHGHANI M, BAHRI D, et al. Efficient transformers: a survey. ACM Computing Surveys, 2022, 55(6): 109.
21	LIN T Y, DOLLAR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 2117-2125.
22	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 13713-13722.
23	MEHTA S, RASTEGARI M. Separable self-attention for mobile vision transformers[EB/OL]. [2022-10-11]. http://arxiv.org/pdf/2206.02680v1.pdf.
24	HAGHVERDI L, LUN A T L, MORGAN M D, et al. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nature Biotechnology, 2018, 36(5): 421- 427. doi: 10.1038/nbt.4091
25	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2017: 764-773.
26	LI B, SHEN C H, DAI Y C, et al. Depth and surface normal estimation from monocular images using regression on deep features and hierarchical CRFs[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2015: 1119-1127.
27	GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite[C]//Proceedings of 2012 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2012: 3354-3361.
28	PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[EB/OL]. [2022-10-11]. https://arxiv.org/abs/1912.01703.pdf.
29	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2022-10-11]. https://arxiv.org/abs/1412.6980.pdf.

[1]	Shuaiwei LIU, Zhi LI, Guomei WANG, Li ZHANG. Adversarial Example Generation Algorithm Based on Transformer and GAN [J]. Computer Engineering, 2024, 50(2): 180-187.
[2]	GAN Chenmin, TANG Hong, YANG Haolan, LIU Xiaojie, LIU Jie. Abstractive Text Summarization Method Incorporating Convolutional Shrinkage Gating [J]. Computer Engineering, 2024, 50(2): 98-104.
[3]	Bingyan ZHU, Zhihua CHEN, Bin SHENG. Remote Sensing Image Detection Based on Perceptually Enhanced Swin Transformer [J]. Computer Engineering, 2024, 50(1): 216-223.
[4]	Xianguo LI, Bin LI. Image Deblurring Based on Transformer and Multi-scale CNN [J]. Computer Engineering, 2023, 49(9): 226-233, 245.
[5]	Yaping CHI, Ziyan YUE, Yuheng LIN. Working Mode Recognition for SM4 Algorithm Based on Transformer [J]. Computer Engineering, 2023, 49(9): 109-117.
[6]	Zhihao LIU, Fanyun MENG, Jinhe WANG, Nan ZHANG. Stereo Matching Algorithm Based on Atrous Convolution and Attention Module [J]. Computer Engineering, 2023, 49(8): 223-231.
[7]	Jiarong ZHANG, Jinsha YUAN, Jianing XU, Zhihong LUO. Mechanics Entities Recognition Algorithm Based on Multi-Meta Information Embedding and Collaborative Neural Network [J]. Computer Engineering, 2023, 49(7): 125-134.
[8]	Hua HOU, Hongyang GUO, Chaona DAI, Junhui LI. Stereo Matching Algorithm Combining Multiple Attention and Iterative Optimization [J]. Computer Engineering, 2023, 49(7): 161-168.
[9]	Kuan WANG, Shibin XUAN, Xuedong HE, Ziwei LI, Jiaxiang LI. Human Pose Estimation Method Based on Cross Attention Transformer [J]. Computer Engineering, 2023, 49(7): 223-231.
[10]	SONG Huawei, QU Xiaojuan, YANG Xin, WAN Fangjie. Flame and Smoke Detection Based on Improved YOLOv5 [J]. Computer Engineering, 2023, 49(6): 250-256.
[11]	ZHAO Hong, CHEN Zhiwen, GUO Lan, AN Dong. Video Content Caption Generation Based on ViT and Semantic Guidance [J]. Computer Engineering, 2023, 49(5): 247-254.
[12]	LI Jianzhi, WANG Hongling, WANG Zhongqing. Research on Summarization Generation Based on Scene and Dialogue Structure [J]. Computer Engineering, 2023, 49(4): 303-311.
[13]	LIAO Liefa, XIE Shusong. Chinese Named Entity Recognition Based on Attention Mechanism Feature Fusion [J]. Computer Engineering, 2023, 49(4): 256-262.
[14]	WANG Chunlei, ZHANG Jianlin, LI Meihui, XU Zhiyong, WEI Yuxing. Object Tracking Algorithm Combining Convolution and Transformer [J]. Computer Engineering, 2023, 49(4): 281-288,296.
[15]	ZHANG Jiong, WANG Lifang, LIN Suzhen, QIN Pinle, MI Jia, LIU Yang. Medical Image Fusion with Local-Global Feature Coupling and Cross-Scale Attention [J]. Computer Engineering, 2023, 49(3): 238-247.

Please choose a citation manager

Content to export