结合多重注意力与迭代优化的立体匹配算法

doi:10.19678/j.issn.1000-3428.0064969

摘要/Abstract

摘要：

基于深度学习的立体匹配算法具有较高的精度，但存在速度慢、显存消耗大与视差搜索范围受限等问题。为此，提出一种多重注意力和迭代优化相结合的立体匹配算法。给出Transformer结构的交叉注意力模块，聚合左右图之间的全局和局部特征信息，获取左右图之间沿极线方向的长距离依赖关系，可更加有效地融合左右图全局特征信息，并生成无视差范围限制的视差图。通过设计迭代残差优化模块，在最小尺度上依据交叉注意力模块生成无视差范围限制的稠密视差图，依据迭代方式逐步恢复视差分辨率生成稀疏代价体，并经过视差回归后估计视差残差图，在保留稠密视差图优点的同时减少计算成本和显存消耗。在此基础上，设计上下文注意力模块，捕获动、静态上下文特征信息，减少浮点运算数量和参数量，为代价聚合提供丰富的显著特征。在SceneFlow、KITTI2012和KITTI2015数据集上的实验结果表明，与主流算法AANet相比，该算法精度分别提高了0.46%、0.47%和0.25%，同时推理速度平均降低了50%。

关键词: 立体匹配, 深度估计, 注意力, Transformer结构, 残差

Abstract:

Stereo matching algorithms based on deep learning offer high accuracy, but they also present challenges such as slow processing speed, high video memory consumption, and limited disparity range.To address these issues, this paper introduces a stereo matching algorithm that combines multiple attention and iterative optimization techniques.In this algorithm, a cross-attention module is proposed using a Transformer structure to aggregate richer global and local feature information between left and right images.This enables the algorithm to capture long-distance dependencies along the polar direction and more effectively fuse global feature information from both images.Additionally, the algorithm generates a disparity map that disregards the limitations of the disparity range.In this study, an iterative residual optimization module is designed, which generates a dense disparity map by leveraging the cross-attention module at the smallest scale.This module iteratively refines the disparity range by gradually restoring sparse cost estimates of disparity residuals, thus retaining the advantages of ignoring the disparity range limit while reducing computational cost and memory consumption. Furthermore, a context attention module is developed to capture dynamic and static context feature information, minimize the number of floating-point operations and parameters, and provide rich salient features for cost aggregation.Experimental results on the SceneFlow, KITTI2012, and KITTI2015 datasets demonstrate that the proposed algorithm improves accuracy by 0.46%, 0.47%, and 0.25%, respectively, and reduces inference speed by 50% when compared to the recent mainstream algorithm, AANet.

Key words: stereo matching, depth estimation, attention, Transformer structure, residual

侯华, 郭宏洋, 代超娜, 李峻辉. 结合多重注意力与迭代优化的立体匹配算法[J]. 计算机工程, 2023, 49(7): 161-168.

Hua HOU, Hongyang GUO, Chaona DAI, Junhui LI. Stereo Matching Algorithm Combining Multiple Attention and Iterative Optimization[J]. Computer Engineering, 2023, 49(7): 161-168.

https://www.ecice06.com/CN/Y2023/V49/I7/161

图/表 15

参考文献 24

1	DUGGAL S, WANG S L, MA W C, et al. DeepPruner: learning efficient stereo matching via differentiable PatchMatch[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2020: 4383-4392.
2	KHAMIS S, FANELLO S, RHEMANN C, et al. StereoNet: guided hierarchical refinement for real-time edge-aware depth prediction[C]//Proceedings of ECCVʼ18. New York, USA: ACM Press, 2018: 596-613.
3	ZHANG F H, PRISACARIU V, YANG R G, et al. GA-Net: guided aggregation net for end-to-end stereo matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 185-194.
4	ŽBONTAR J, LECUN Y. Computing the stereo matching cost with a convolutional neural network[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2015: 1592-1599.
5	MAYER N, ILG E, HÄUSSER P, et al. A large dataset to train convolutional networks for disparity, optical flow, and SceneFlow estimation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 4040-4048.
6	PANG J H, SUN W X, REN J S, et al. Cascade residual learning: a two-stage convolutional neural network for stereo matching[C]//Proceedings of IEEE International Conference on Computer Vision Workshops. Washington D. C., USA: IEEE Press, 2018: 878-886.
7	KENDALL A, MARTIROSYAN H, DASGUPTA S, et al. End-to-end learning of geometry and context for deep stereo regression[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 66-75.
8	TANKOVICH V, HÄNE C, ZHANG Y D, et al. HITNet: hierarchical iterative tile refinement network for real-time stereo matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 14357-14367.
9	HIRSCHMULLER H. Accurate and efficient stereo processing by semi-global matching and mutual information[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2005: 807-814.
10	CHENG X L, ZHONG Y R, HARANDI M, et al. Hierarchical neural architecture search for deep stereo matching[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2020: 22158-22169.
11	GEIGER A, LENZ P, URTASUN R. Are we ready for autonomous driving? The KITTI vision benchmark suite[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2012: 3354-3361.
12	CHEN R, HAN S F, XU J, et al. Point-based multi-view stereo network[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2020: 1538-1547.
13	GU X D, FAN Z W, ZHU S Y, et al. Cascade cost volume for high-resolution multi-view stereo and stereo matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 2492-2501.
14	YU Z H, GAO S H. Fast-MVSNet: sparse-to-dense multi-view stereo with learned propagation and Gauss-Newton refinement[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 1946-1955.
15	LI Y H, YAO T, PAN Y W, et al. Contextual transformer networks for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(2): 1489- 1500.
16	LI X, WANG W H, HU X L, et al. Selective kernel networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 510-519.
17	CHANG J R, CHANG P C, CHEN Y S. Attention-aware feature aggregation for real-time stereo matching on edge devices[C]// Proceedings of ACCVʼ20. Berlin, Germany: Springer, 2021: 365-380.
18	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5686-5696.
19	DISTANCES C M S. Fast computation of graph kernels[C]//Proceedings of Advances in Neural Information Processing Systems. [S. 1. ]: MIT Press, 2007: 2292-2300.
20	XU H F, ZHANG J Y. AANet: adaptive aggregation network for efficient stereo matching[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 1956-1965.
21	LI Z S, LIU X T, DRENKOW N, et al. Revisiting stereo depth estimation from a sequence-to-sequence perspective with transformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2022: 6177-6186.
22	SARLIN P E, DETONE D, MALISIEWICZ T, et al. SuperGlue: learning feature matching with graph neural networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 4937-4946.
23	YANG M L, WU F R, LI W. RLStereo: real-time stereo matching based on reinforcement learning. IEEE Transactions on Image Processing, 2021, 30(7): 9442- 9455.
24	WANG Q, SHI S H, ZHENG S Z, et al. FADNet: a fast and accurate network for disparity estimation[C]//Proceedings of IEEE International Conference on Robotics and Automation. Washington D. C., USA: IEEE Press, 2020: 101-107.

组成部分				3像素误差/%↓	EPE/ 像素↓
COA	CA	OPT	CAL	3像素误差/%↓	EPE/ 像素↓
				13.75	2.95
√				12.11	2.57
√	√			5.01	1.08
√	√	√		4.32	0.95
√	√	√	√	2.95	0.82

组成部分				3像素误差/%↓	EPE/ 像素↓
COA	CA	OPT	CAL	3像素误差/%↓	EPE/ 像素↓
				13.75	2.95
√				12.11	2.57
√	√			5.01	1.08
√	√	√		4.32	0.95
√	√	√	√	2.95	0.82

特征提取	参数量/个	浮点运算数/10⁹次	EPE/像素
ResNet-3-4-6	417 072	3.02	2.95
COA-3	419 346	2.85	2.89
COA-3-4	416 450	2.60	2.70
COA-3-4-6	355 730	2.21	2.57

特征提取	参数量/个	浮点运算数/10⁹次	EPE/像素
ResNet-3-4-6	417 072	3.02	2.95
COA-3	419 346	2.85	2.89
COA-3-4	416 450	2.60	2.70
COA-3-4-6	355 730	2.21	2.57

算法	KITTI2015
	Noc/%			All/%			运行时间/s
	D1-fg	D1-bg	D1-all	D1-fg	D1-bg	D1-all	运行时间/s
LEAStereo	2.65	1.29	1.51	2.91	1.40	1.65	0.30
STTR	3.44	1.57	1.83	3.61	1.70	2.01	0.36
AANet	4.93	1.80	2.32	5.39	1.99	2.55	0.06
RLStereo	4.76	1.91	2.38	5.38	2.09	2.64	0.03
FADNet	3.07	2.49	2.59	3.10	2.50	2.60	0.05
MAIRNet	4.22	1.69	2.07	4.52	1.87	2.30	0.03

选择文件类型/文献管理软件名称

选择包含的内容