Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Self-supervised Monocular Depth Estimation by Fusing Graph Neural Network and Laplacian Pyramid

  

  • Published:2026-01-30

融合图神经网络与拉普拉斯金字塔的自监督单目深度估计

Abstract: xisting self-supervised monocular depth estimation models typically use convolutional neural networks and Transformers for feature encoding and decoding. However, these architectures struggle to flexibly and efficiently capture geometric features of irregular and complex objects in scenes. Moreover, as the network deepens, high-frequency edge information in the image is progressively weakened, resulting in depth features lacking crucial edge details and ultimately degrading model performance. To address these issues, this paper proposes a self-supervised monocular depth estimation model that integrates graph neural networks with Laplacian pyramid. Firstly, a Vision Graph Neural Network (ViG) is employed as the backbone to model global topological structure relationships within the scene. Secondly, a Laplacian Residual Fusion module is designed. It first concatenates the Laplacian pyramid residuals with the encoded and decoded features along the spatial dimension and then employs channel attention to recalibrate the channel weights. This achieves efficient fusion of the Laplacian pyramid in both spatial and channel dimensions, thereby enhancing the edge details of the decoded features. Finally, an Edge-Guided Graph Reasoning module is proposed, which treats pixels at object boundaries as graph nodes and then performs explicit graph reasoning to enhance the quality of depth estimation in these boundary regions. Experiment results compared with the baseline method Monodepth2 on the KITTI dataset demonstrate that the proposed model achieves a 12.2% reduction in the absolute relative error Abs Rel, a 21.4% reduction in the squared relative error Sq Rel, and the threshold accuracy reaches 89.6% at a threshold value of 1.25. Furthermore, experimental results on the Make3D dataset demonstrate that the model also achieves good depth estimation performance on unseen scenes. Qualitative visualizations also indicate that the proposed model achieves superior performance in predicting depth maps with sharper edges and richer detail.

摘要: 现有的自监督单目深度估计模型通常使用卷积神经网络和Transformer进行特征编码与解码,难以灵活高效地捕捉场景中不规则和复杂物体的几何特征,并且随着神经网络层数的加深,图像中的高频边缘信息被逐渐弱化,导致深度特征缺少关键的边缘细节而影响模型的性能。为解决这些问题,提出一种融合图神经网络与拉普拉斯金字塔的自监督单目深度估计模型。首先,引入视觉图神经网络(Vision Graph Neural Network, ViG)作为骨干网络,对场景的全局拓扑结构关联进行建模。其次,设计了拉普拉斯残差融合模块,将拉普拉斯金字塔残差与编码特征,解码特征在空间维度上拼接融合后使用通道注意力重新校准通道权重,从而实现拉普拉斯金字塔在空间和通道两个维度上的高效融合,以增强解码特征的边缘细节。最后,提出边缘引导的图推理模块,将位于边界处的像素视为图节点并进行显式地图推理,提高物体边界处深度估计的质量。通过在KITTI数据集上进行实验评估,结果显示所提模型与基线方法Monodepth2相比,绝对相对误差Abs Rel降低了12.2%,平方相对误差Sq Rel降低了21.4%,阈值为1.25的准确率提升到89.6%。此外,Make3D数据集的实验结果表明了模型在未见场景上同样具有良好的深度估计性能。相关可视化也验证了模型在预测边缘清晰,细节丰富的深度图上具有一定的优越性。