作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (2): 268-278. doi: 10.19678/j.issn.1000-3428.0059554

• 图形图像处理 • 上一篇    下一篇

基于场景模态深度理解网络的单目图像深度理解

陈扬, 李大威   

  1. 东华大学 信息科学与技术学院, 上海 201620
  • 收稿日期:2020-09-25 修回日期:2020-10-28 出版日期:2021-02-15 发布日期:2020-11-05
  • 作者简介:陈扬(1994-),男,硕士研究生,主研方向为计算机视觉、三维图像重建;李大威(通信作者),副教授、博士。
  • 基金资助:
    国家自然科学基金(61603089);上海市自然科学基金(20ZR1400800)。

Monocular Image Depth Understanding Based on Scene Modality Depth Understanding Network

CHEN Yang, LI Dawei   

  1. College of Information Sciences and Technology, Donghua University, Shanghai 201620, China
  • Received:2020-09-25 Revised:2020-10-28 Online:2021-02-15 Published:2020-11-05

摘要: 基于深度卷积神经网络的图像处理方法得到的单目深度图像质量远高于传统图像处理方法,但该方法对无用特征的训练易产生误差积累,且基于回归求解的连续深度距离预测精度较低,导致图像深度信息提取不精确、目标边缘模糊与图像细节缺失。提出一种应用于单目彩色图像的场景模态深度理解网络。建立以堆叠沙漏为主框架的网络模型,通过反复进行自下而上和自上而下的特征提取过程融合低层次纹理与高级语义特征,在每层网络训练中结合离散的深度标签和真实深度图像降低深度理解难度,插入误差修正子模块和极大似然译码优化子模块以准确提取深度特征。实验结果表明,该网络获取的深度信息更准确,其在NYUv2数据集上绝对相关误差较ACAN网络降低0.72%,在KITTI数据集上均方相关误差较GASDA网络降低41.28%,与DORN等深度网络相比,其预测的深度图像包含更多细节信息且目标轮廓更清晰。

关键词: 单目深度理解, 场景模态标签, 有序回归, 误差修正, 极大似然译码

Abstract: The monocular depth image quality obtained by the image processing method based on Depth Convolution Neural Network (DCNN) is much higher than that of traditional image processing methods.However,this method is prone to error accumulation in the training of useless features,and the accuracy of continuous depth distance prediction based on regression solution is low,which leads to inaccurate image depth information extraction,blurred target edge and lack of image details.This paper proposes a Scene Modality Depth Understanding Network(SMDUN) for monocular color images.A network model based on stacked hourglass is established.Through repeated bottom-up and top-down processes,low-level texture and high-level semantic features are fused.In each layer of network training,discrete depth tags and real depth images are combined to reduce the difficulty of depth understanding.Error correction sub module and maximum likelihood decoding optimization sub module are inserted to accurately extract depth features.Experimental results show that the network can obtain more accurate depth information,the Absolute Relative Error(AbsRel) of NYUv2 dataset is 0.72% lower than that of ACAN network,and the Mean Squared Relative Error(MSqRel) of KITTI dataset is 41.28% lower than that of GASDA network.Compared with DORN and other depth networks, the predicted depth image contains more detail information and the target contour is clearer.

Key words: monocular depth understanding, scene modality labeling, ordinal regression, error correction, maximum likelihood decoding

中图分类号: