Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (10): 258-269. doi: 10.19678/j.issn.1000-3428.0069157

• Graphics and Image Processing • Previous Articles     Next Articles

Bidirectional Spatio-Temporal Feature Learning for Human Depth Estimation

ZHU Zibin1, LI Qianlin1, ZHANG Xiaoyan1,*(), HAN Shuangshuang2   

  1. 1. College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, Guangdong, China
    2. School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing 100083, China
  • Received:2024-01-03 Revised:2024-05-10 Online:2025-10-15 Published:2024-08-06
  • Contact: ZHANG Xiaoyan

基于双向时空特征学习的人体深度图像估计

朱子斌1, 李千林1, 张小燕1,*(), 韩双双2   

  1. 1. 深圳大学计算机与软件学院,广东 深圳 518060
    2. 北京科技大学计算机与通信工程学院,北京 100083
  • 通讯作者: 张小燕
  • 基金资助:
    广东省基础与应用基础研究基金(2020B1515120047); 广东省自然科学基金(2021A1515011632); 广东省自然科学基金(2021A1515012014)

Abstract:

To improve the accuracy of predicting human depth images, a video-based human depth image estimation method called BiSTNet is proposed. Additionally, to fully mine three-dimensional (3D) information from videos, a bidirectional spatio-temporal feature learning model is introduced. This model uses two sequence directions, namely past and future frames, for feature learning and employs a bidirectional spatio-temporal feature attention model to enhance the influence of effective frames. Furthermore, a multiscale feature fusion prediction module is incorporated to predict precise depth images with rich local geometric details by effectively fusing bidirectional spatio-temporal and spatial features, thereby improving the accuracy of the 3D models reconstructed from the predicted depth images. During the model training process, constraints on the relative sequential relationships of human joints and a bidirectional sequence self-supervised learning strategy are utilized to improve prediction accuracy while reducing reliance on supervised data. The experimental results demonstrate that the BiSTNet method not only effectively reduces errors during prediction of depth images but also produces depth images with abundant details.

Key words: bidirectional spatio-temporal feature, multi-scale feature fusion, bidirectional self-supervision, depth map estimation, relative human joints order constraints

摘要:

为了更准确地预测人体的深度图像,提出一种基于视频的人体深度图像估计方法BiSTNet。为了从视频中充分挖掘三维(3D)信息,提出双向时空特征学习模型,分别从过去帧和未来帧2个序列方向学习双向时空特征,并利用双向时空特征注意力模型来强化有效帧的影响。同时,引入多尺度特征融合预测模块,在有效融合双向时空特征和空间特征的基础上,预测精确的、具有丰富局部几何细节的深度图像,使得由预测深度图像重建的3D模型更加逼真。在模型训练过程中,使用人体关节相对顺序关系约束和双向序列自监督学习策略,在提高预测精度的同时降低对有监督数据的依赖性。实验结果表明,BiSTNet方法不仅能有效降低预测深度图像的误差,而且所预测的深度图像细节丰富。

关键词: 双向时空特征, 多尺度特征融合, 双向自监督, 深度图像估计, 人体关节相对顺序关系约束