Bidirectional Spatio-Temporal Feature Learning for Human Depth Estimation

doi:10.19678/j.issn.1000-3428.0069157

Abstract

Abstract:

To improve the accuracy of predicting human depth images, a video-based human depth image estimation method called BiSTNet is proposed. Additionally, to fully mine three-dimensional (3D) information from videos, a bidirectional spatio-temporal feature learning model is introduced. This model uses two sequence directions, namely past and future frames, for feature learning and employs a bidirectional spatio-temporal feature attention model to enhance the influence of effective frames. Furthermore, a multiscale feature fusion prediction module is incorporated to predict precise depth images with rich local geometric details by effectively fusing bidirectional spatio-temporal and spatial features, thereby improving the accuracy of the 3D models reconstructed from the predicted depth images. During the model training process, constraints on the relative sequential relationships of human joints and a bidirectional sequence self-supervised learning strategy are utilized to improve prediction accuracy while reducing reliance on supervised data. The experimental results demonstrate that the BiSTNet method not only effectively reduces errors during prediction of depth images but also produces depth images with abundant details.

Key words: bidirectional spatio-temporal feature, multi-scale feature fusion, bidirectional self-supervision, depth map estimation, relative human joints order constraints

摘要：

为了更准确地预测人体的深度图像，提出一种基于视频的人体深度图像估计方法BiSTNet。为了从视频中充分挖掘三维(3D)信息，提出双向时空特征学习模型，分别从过去帧和未来帧2个序列方向学习双向时空特征，并利用双向时空特征注意力模型来强化有效帧的影响。同时，引入多尺度特征融合预测模块，在有效融合双向时空特征和空间特征的基础上，预测精确的、具有丰富局部几何细节的深度图像，使得由预测深度图像重建的3D模型更加逼真。在模型训练过程中，使用人体关节相对顺序关系约束和双向序列自监督学习策略，在提高预测精度的同时降低对有监督数据的依赖性。实验结果表明，BiSTNet方法不仅能有效降低预测深度图像的误差，而且所预测的深度图像细节丰富。

关键词: 双向时空特征, 多尺度特征融合, 双向自监督, 深度图像估计, 人体关节相对顺序关系约束

ZHU Zibin, LI Qianlin, ZHANG Xiaoyan, HAN Shuangshuang. Bidirectional Spatio-Temporal Feature Learning for Human Depth Estimation[J]. Computer Engineering, 2025, 51(10): 258-269.

朱子斌, 李千林, 张小燕, 韩双双. 基于双向时空特征学习的人体深度图像估计[J]. 计算机工程, 2025, 51(10): 258-269.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069157

https://www.ecice06.com/EN/Y2025/V51/I10/258

Figures/Tables 13

Fig.1 BiSTNet structure

Fig.2 Structure of spatial feature extraction module and multi-scale feature fusion and prediction module

Fig.3 Structure of bidirectional spatio-temporal feature extraction module

Fig.4 Comparison of depth prediction results using different methods

Fig.5 Depth map prediction results of models based on unidirectional and bidirectional spatio-temporal feature extraction

Fig.6 Depth map prediction results of models with and without attention

Fig.7 The impact of different self-supervised parameters on ME, RMSE, and SIM evaluation indicators

Fig.8 Application examples of wild images

Fig.9 Application examples in occluded environments

References 34

1	江俊君, 李震宇, 刘贤明. 基于深度学习的单目深度估计方法综述. 计算机学报, 2022, 45(6): 1276- 1307.
	JIANG J J, LI Z Y, LIU X M. Deep learning based monocular depth estimation: a survey. Chinese Journal of Computers, 2022, 45(6): 1276- 1307.
2	张家豪, 张娟, 郎晓奇. 雨天场景下单目图像深度估计与清晰化算法. 小型微型计算机系统, 2023, 44(11): 2584- 2590.
	ZHANG J H, ZHANG J, LANG X Q. Depth estimation and clarification method for monocular images in rainy scenes. Journal of Chinese Computer Systems, 2023, 44(11): 2584- 2590.
3	SIDDIQUI S A, VIERLING A, BERNS K. Multi-modal depth estimation using convolutional neural networks[C]//Proceedings of the IEEE International Symposium on Safety, Security, and Rescue Robotics (SSRR). Washington D.C., USA: IEEE Press, 2020: 354-359.
4	王思成, 江浩, 陈晓. 具有跨尺度Transformer的高效多视图立体网络. 计算机工程, 2024, 50(11): 266- 275. doi: 10.19678/j.issn.1000-3428.0068947
	WANG S C, JIANG H, CHEN X. Efficient multi-view stereo network with cross-scale Transformer. Computer Engineering, 2024, 50(11): 266- 275. doi: 10.19678/j.issn.1000-3428.0068947
5	温静, 杨洁. 基于场景对象注意与深度图融合的深度估计. 计算机工程, 2023, 49(2): 222- 230. doi: 10.19678/j.issn.1000-3428.0064268
	WEN J, YANG J. Depth estimation based on scene object attention and depth map fusion. Computer Engineering, 2023, 49(2): 222- 230. doi: 10.19678/j.issn.1000-3428.0064268
6	ZHANG H K, LI Y, CAO Y, et al. Exploiting temporal consistency for real-time video depth estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 1725-1734.
7	KUMAR A C, BHANDARKAR S M, PRASAD M. Monocular depth prediction using generative adversarial networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2018: 413-4138.
8	TAN F T, ZHU H, CUI Z P, et al. Self-supervised human depth estimation from monocular videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 647-656.
9	JAFARIAN Y, PARK H S. Self-supervised 3D representation learning of dressed humans from social media videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 5, 1- 14.
10	LI Q L, ZHANG X Y. Video-based self-supervised human depth estimation[EB/OL]. [2023-10-05]. https://link.springer.com/chapter/10.1007/978-3-031-50069-5_16.
11	MATTHEW L, NAUREEN M, JAVIER R, et al. SMPL: a skinned multiperson linear model. ACM Transactions on Graphics, 2015, 34(6): 1- 16.
12	侯华, 郭宏洋, 代超娜, 等. 结合多重注意力与迭代优化的立体匹配算法. 计算机工程, 2023, 49(7): 161-168, 178. doi: 10.19678/jissn.1000-3428.0064969
	HOU H, GUO H Y, DAI C N, et al. Stereo matching algorithm combining multiple attention and iterative optimization. Computer Engineering, 2023, 49(7): 161-168, 178. doi: 10.19678/jissn.1000-3428.0064969
13	TANG S C, TAN F T, CHENG K, et al. A neural network for detailed human depth estimation from a single image[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 7749-7758.
14	LIN J H, LEE G H. HDNet: human depth estimation for multi-person camera-space localization[EB/OL]. [2023-10-05]. https://link.springer.com/chapter/10.1007/978-3-030-58523-5_37.
15	LI H L, PUN C M. Monocular robust 3D human localization by global and body-parts depth awareness. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(11): 7692- 7705. doi: 10.1109/TCSVT.2022.3180737
16	JAFARIAN Y, PARK H S. Learning high fidelity depths of dressed humans by watching social media dance videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 12748-12757.
17	SAITO S, HUANG Z, NATSUME R, et al. PIFu: pixel-aligned implicit function for high-resolution clothed human digitization[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 2304-2314.
18	SAITO S, SIMON T, SARAGIH J, et al. PIFuHD: multi-level pixel-aligned implicit function for high-resolution 3D human digitization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 81-90.
19	ZHENG Z R, YU T, LIU Y B, et al. PaMIR: parametric model-conditioned implicit representation for image-based human reconstruction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(6): 3170- 3184. doi: 10.1109/TPAMI.2021.3050505
20	XIU Y L, YANG J L, TZIONAS D, et al. ICON: implicit clothed humans obtained from normals[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13286-13296.
21	JIANG B Y, HONG Y, BAO H J, et al. SelfRecon: self reconstruction your digital avatar from monocular video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 5595-5605.
22	YE V, PAVLAKOS G, MALIK J, et al. Decoupling human and camera motion from videos in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 21222-21232.
23	李雷, 徐浩, 吴素萍. 基于DDPG的三维重建模糊概率点推理. 自动化学报, 2022, 48(4): 1105- 1118.
	LI L, XU H, WU S P. Fuzzy probability points reasoning for 3D reconstruction via deep deterministic policy gradient. Acta Automatica Sinica, 2022, 48(4): 1105- 1118.
24	ELMAN J L. Finding structure in time. Cognitive Science, 1990, 14(2): 179- 211. doi: 10.1207/s15516709cog1402_1
25	GRAVES A. Supervised sequence labelling with recurrent neural networks[EB/OL]. [2023-10-05]. https://www.cs.toronto.edu/~graves/preprint.pdf.
26	SHI X, CHEN Z, WANG H, et al. Convolutional LSTM network: a machine learning approach for precipitation nowcasting[EB/OL]. [2023-10-05]. https://proceedings.neurips.cc/paper/2015/file/07563a3fe3bbe7e3ba84431ad9d055af-Paper.pdf.
27	XU Q, QIAN Y T. Bidirectional Transformer for video deblurring. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(12): 8450- 8461. doi: 10.1109/TCSVT.2022.3195747
28	DENG H Q, ZHANG Z X, ZOU S H, et al. Bi-directional frame interpolation for unsupervised video anomaly detection[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2023: 2633-2642.
29	HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//Proceedings of the IEEE International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2017: 2980-2988.
30	NEWELL A, YANG K Y, DENG J. Stacked hourglass networks for human pose estimation[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1603.06937.
31	VAROL G, ROMERO J, MARTIN X, et al. Learning from synthetic humans[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2017: 4627-4635.
32	GULER R A, NEVEROVA N, KOKKINOS I. DensePose: dense human pose estimation in the wild[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 7297-7306.
33	YU T, ZHENG Z R, GUO K W, et al. Function4D: real-time human volumetric capture from very sparse consumer RGBD sensors[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 5742-5752.
34	LORENSEN W E, CLINE H E. Marching Cubes: a high resolution 3D surface construction algorithm[C]//Proceedings of the 14th Annual Conference on Computer Graphics and Interactive Techniques. New York, USA: ACM Press, 1987: 163-169.

[1]	YAN Jianhong, LIU Zhiyan, WANG Zhen. Multi-Scale Convolutional Vehicle Trajectory Prediction Integrating Spatiotemporal Attention Mechanism [J]. Computer Engineering, 2025, 51(8): 406-414.
[2]	LUAN Mengna, ZHENG Qiumei, WANG Fenghua. Real-time Traffic Sign Detection Algorithm Based on DMC-YOLO [J]. Computer Engineering, 2025, 51(7): 90-99.
[3]	LUAN Fangjun, GONG Qi, YUAN Shuai. Crowd Counting Network Based on Attention Mechanism and Multiscale Fusion [J]. Computer Engineering, 2025, 51(3): 352-361.
[4]	XU Ming, QU Taipeng, JIANG Yanji. Improved YOLOv7 Traffic Sign Detection Algorithm in Complex Scenarios [J]. Computer Engineering, 2025, 51(2): 335-343.
[5]	JIANG Honggui, HU Jisu, QIAN Xusheng, ZHENG Yi, ZHOU Zhiyong, DAI Yakang. Registration Method of MRI-TRUS Images Based on Joint Learning and Multi-Level Wavelet Feature Pyramid [J]. Computer Engineering, 2025, 51(10): 270-283.
[6]	YANG Shuo, WANG Yiding. Facial Animation Algorithm Based on Improved Thin Plate Spline Motion Model [J]. Computer Engineering, 2024, 50(6): 255-265.
[7]	Zhina SONG, Sha LI, Jianming YANG, Chuan XU. Remote Sensing Ship Target Detection Based on Feature and Region Localization Enhancement [J]. Computer Engineering, 2023, 49(8): 257-264.
[8]	REN Jiahao, ZHANG Guanghua, QIAO Gangzhu, WU Xiuping. Cephalometric Mark Point Detection with Multi-scale Feature Fusion [J]. Computer Engineering, 2023, 49(3): 271-279.
[9]	Zhongren LIU, Li PENG. Salient Object Detection with Multi-Scale Visual Perception and Fusion [J]. Computer Engineering, 2023, 49(12): 186-193.
[10]	Xiangquan GUI, Xinyue ZHANG, Li LI. Two-Stage Segmentation Algorithm of High Resolution Skin Melanoma Image [J]. Computer Engineering, 2023, 49(11): 267-274.
[11]	YU Min, QU Dan, SI Nianwen. Improved RetinaNet Algorithm for Object Detection [J]. Computer Engineering, 2022, 48(8): 249-257.
[12]	HE Xiaohui, SONG Dingjun, LI Panle, TIAN Zhihui, ZHOU Guangsheng. Remote Sensing Image Road Extraction Method Combined with Multi-Scale Features [J]. Computer Engineering, 2022, 48(8): 196-205.
[13]	BAI Zongwen, YI Tingting, ZHOU Meili, WEI Wei. Face Image Inpainting Method Based on Multi-Scale Feature Fusion [J]. Computer Engineering, 2021, 47(5): 213-220,228.

Please choose a citation manager

Content to export