Geometry Relationship-aware Representation Learning Model for Human Pose Estimation in Teaching Scenarios

doi:10.19678/j.issn.1000-3428.0069738

Abstract

Abstract:

Human Pose Estimation (HPE) is an important research task in the field of computer vision and is widely used in teaching scenarios. Currently, this task faces many challenges, such as reduced accuracy in complex scenarios, including cluttered backgrounds, small human body image scales, and occluded human bodies. Simultaneously, the flexibility and variability of human body postures require the model to have a good reasoning ability. This study proposes a geometric relationship-aware human pose representation learning model to address these problems. It uses the structured information of the human body to help the model better understand the relationship between different poses, thereby improving the accuracy and robustness of complex pose predictions to achieve effective application in classroom scenarios. The model includes four modules: channel reweighting, multi-token information interaction, limb direction construction, and adaptive loss propagation. The limb direction construction module implements the modeling of the geometric structure between the human body joints. This input clue helps the model capture the relative position and direction relationship between body parts. The channel reweighting module automatically selects and emphasizes the most helpful feature information for the pose estimation task, improving the expression ability of the visual features of the input image. The multi-token information interaction module, which is based on the Transformer encoder, realizes efficient interactions among image feature clues, joint coordinate clues, and limb direction cues. Finally, this study optimizes the traditional loss function in the adaptive loss propagation module to further improve the training effect and performance of the model. The model achieves accuracy rates of 76.1% and 90.3% on two mainstream datasets, COCO and MPII, respectively, outperforming some existing SOTA (State of the Art) models. The proposed model achieves more accurate and reasonable prediction results in complex scenarios.

Key words: Human Pose Estimation (HPE), geometry structure cue, limb direction, Transformer, image understanding

摘要：

人体姿态估计(HPE)任务是计算机视觉领域中的一项重要研究工作, 它在教学场景下有着广泛应用。当前该任务仍然面临着许多挑战, 例如在背景杂乱、人体图像尺度小、人体被遮挡等复杂场景下出现准确率下降的问题, 与此同时, 人体姿态的灵活多变性则要求模型具有良好的推理预测能力。针对上述问题, 提出一种几何关系感知的人体姿态表示学习模型, 通过人体的结构化信息来帮助模型更好地理解不同姿态之间的关系, 从而提高对复杂姿势预测的准确性和鲁棒性, 实现其在课堂场景下的有效应用。该模型主要包括通道重加权、多token信息交互、肢体方向构建和自适应损失传播4个模块。肢体方向构建模块实现了对人体关节之间几何结构的建模, 这一输入线索有利于模型捕捉到身体部位之间的相对位置和方向关系; 通道重加权模块能够自动选择和强调对姿态估计任务最有帮助的特征信息, 提升输入图像的视觉特征的表达能力; 基于Transformer编码器的多token信息交互模块实现了图像特征线索、关节坐标线索和肢体方向线索之间的有效交互; 最后, 在自适应损失传播模块对传统的损失函数进行优化, 进一步提高了模型的训练效果和性能。模型在2个主流数据集COCO和MPII上分别达到了76.1%、90.3%的准确率, 超过了现有的一些SOTA(State of the Art)模型, 在复杂场景下实现了更加准确合理的预测结果。

关键词: 人体姿态估计, 几何结构线索, 肢体方向, Transformer, 图像理解

LIU Hai, ZHU Junyan, ZHANG Zhaoli, ZHOU Qiyun, SONG Yunxiao. Geometry Relationship-aware Representation Learning Model for Human Pose Estimation in Teaching Scenarios[J]. Computer Engineering, 2025, 51(10): 97-110.

刘海, 朱俊艳, 张昭理, 周启云, 宋云霄. 教学场景下基于几何关系感知的人体姿态估计表示学习模型[J]. 计算机工程, 2025, 51(10): 97-110.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069738

https://www.ecice06.com/EN/Y2025/V51/I10/97

Figures/Tables 15

Fig.1 Three challenges faced in HPE task: occlusion, complex background, and small scale body

Fig.2 Visual information, joint information, and limb direction information contained in the image

Fig.3 Geometry relationship-aware HPE network model structure

Fig.4 Implementation flow of channel reweighting mechanism

Fig.5 Accuracy results using different position encoding on the COCO dataset

Fig.6 Visualization of similarity calculation of limb grouping vector

Fig.7 Visualization of the image visual token in the Transformer encoder

Fig.8 Visualization of prediction results (seventeen human joints and eight limb directions)

Fig.9 Visualization of joint coordinates and joints connection results on the COCO dataset

References 37

1	WANG Z G . Real-time dance posture tracking method based on lightweight network. Wireless Communications and Mobile Computing, 2022, 2022 (1): 5001896. doi: 10.1155/2022/5001896
2	LIU Z G, FENG R Y, CHEN H M, et al. Temporal feature alignment and mutual information maximization for video-based human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE Press, 2022: 10996-11006.
3	YI C Z , JIANG F , ZHANG S P , et al. Continuous prediction of lower-limb kinematics from multi-modal biomedical signals. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (5): 2592- 2602. doi: 10.1109/TCSVT.2021.3071461
4	CUI C , MA Y S , CAO X , et al. Receive, reason, and react: drive as you say, with large language models in autonomous vehicles. IEEE Intelligent Transportation Systems Magazine, 2024, 16 (4): 81- 94. doi: 10.1109/MITS.2024.3381793
5	李玉荣. 基于计算机视觉技术的智能化课堂管理系统研究. 通信与信息技术, 2024 (2): 130- 136.
	LI Y R . Research on intelligent classroom management system based on computer vision technology. Communication & Information Technology, 2024 (2): 130- 136.
6	孔令凯, 王森. 人工智能辅助姿态识别和运动处方的研究. 现代电子技术, 2024, 47 (4): 139- 142.
	KONG L K , WANG S . Research on artificial intelligence assisted motion recognition and exercise prescription. Modern Electronics Technique, 2024, 47 (4): 139- 142.
7	杨蕊婷, 袁磊, 林勤, 等. 基于人体姿态估计的仰卧起坐动作诊断系统. 通信与信息技术, 2022 (S2): 80- 82.
	YANG R T , YUAN L , LIN Q , et al. A technical action diagnosis of sit ups based on human posture estimation. Communication & Information Technology, 2022 (S2): 80- 82.
8	PISHCHULIN L, ANDRILUKA M, GEHLER P, et al. Poselet conditioned pictorial structures[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Portland, USA: IEEE Press, 2013: 588-595.
9	YANG Y , RAMANAN D . Articulated human detection with flexible mixtures of parts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35 (12): 2878- 2890. doi: 10.1109/TPAMI.2012.261
10	SUN M, SAVARESE S. Articulated part-based model for joint object detection and pose estimation[C]//Proceedings of the International Conference on Computer Vision. Barcelona, Spain: IEEE Press, 2011: 723-730.
11	TIAN Y, ZITNICK C L, NARASIMHAN S G. Exploring the spatial hierarchy of mixture models for human pose estimation[C]//Proceedings of the European Conference on Computer Vision. Florence, Italy: Springer, 2012: 256-269.
12	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60 (6): 84- 90. doi: 10.1145/3065386
13	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: IEEE Press, 2017: 6000-6010.
14	NEWELL A, YANG K, DENG J. Stacked hourglass networks for human pose estimation[C]//Proceedings of European Conference on Computer Vision. Amsterdam, the Netherlands: Springer, 2016: 483-499.
15	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Salt Lake City, USA: IEEE Press, 2018: 7103-7112.
16	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Long Beach, USA: IEEE Press, 2019: 5686-5696.
17	PISHCHULIN L, INSAFUTDINOV E, TANG S Y, et al. DeepCut: joint subset partition and labeling for multi person pose estimation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Las Vegas, USA: IEEE Press, 2016: 4929-4937.
18	CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE Press, 2017: 1302-1310.
19	WANG D K, ZHANG S L. Contextual instance decoupling for robust multi-person pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE Press, 2022: 11050-11058.
20	MAJI D, NAGORI S, MATHEW M, et al. YOLO-pose: enhancing YOLO for multi person pose estimation using object keypoint similarity loss[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE Press, 2022: 2636-2645.
21	SHI D H, WEI X, LI L Q, et al. End-to-end multi-person pose estimation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). New Orleans, USA: IEEE Press, 2022: 11059-11068.
22	LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: learning keypoint tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE Press, 2021: 11293-11302.
23	LI K C , WANG Y L , ZHANG J H , et al. UniFormer: unifying convolution and self-attention for visual recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (10): 12581- 12600. doi: 10.1109/TPAMI.2023.3282631
24	PARK N, KIM S. How do vision transformers work?[EB/OL]. [2024-02-01]. https://arxiv.org/abs/2202.06709v4.
25	YUAN Y H, FU R, HUANG L, et al. HRFormer: high-resolution transformer for dense prediction[EB/OL]. [2024-02-01]. https://arxiv.org/abs/2110.09408.
26	XU Y F , ZHANG J , ZHANG Q M , et al. ViTPose: simple vision transformer baselines for human pose estimation. Advances in Neural Information Processing Systems, 2022, 35, 38571- 38584.
27	MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL]. [2024-02-01]. https://arxiv.org/abs/1301.3781v3.
28	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2024-02-01]. https://arxiv.org/abs/1810.04805v2.
29	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2024-02-01]. https://arxiv.org/abs/2010.11929.
30	WANG Y L, HUANG R, SONG S J, et al. Not all images are worth 16×16 words: dynamic transformers for efficient image recognition[EB/OL]. [2024-02-01]. https://arxiv.org/abs/2105.15075v2.
31	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft coco: Common objects in context[C]//Proceedings of the European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014: 740-755.
32	ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Columbus, USA: IEEE Press, 2014: 3686-3693.
33	XIAO B, WU H, WEI Y. Simple baselines for human pose estimation and tracking[C]//Proceedings of the European Conference on Computer Vision. Munich, Germany: Springer, 2018: 466-481.
34	DAS A, DAS S, SISTU G, et al. Deep multi-task networks for occluded pedestrian pose estimation[C]//Proceedings of the 24th Irish Machine Vision and Image Processing Conference. [S. l. ]: Irish Pattern Recognition and Classification Society, 2022: 177-180.
35	MAO W A, GE Y T, SHEN C H, et al. TFPose: direct human pose estimation with transformers[EB/OL]. [2024-02-01]. https://arxiv.org/abs/2103.15320.
36	YANG S, QUAN Z B, NIE M, et al. TransPose: keypoint localization via transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Montreal, Canada: IEEE Press, 2021: 11782-11792.
37	ZHAO S T, LIU K, HUANG Y H, et al. DPIT: dual-pipeline integrated transformer for human pose estimation[C]//Proceedings of CAAI International Conference on Artificial Intelligence. Cham, Germany: Springer, 2022: 559-576.

[1]	ZHANG Zhaoli, LI Jiahao, LIU Hai, SHI Fobo, HE Jiawen. Personalized Forgetting Modeling for Knowledge Tracing via Transformers [J]. Computer Engineering, 2025, 51(8): 120-130.
[2]	JI Lixia, ZHOU Hongxin, XIAO Shijie, CHEN Yunfeng, ZHANG Han. A Research on Training Method for Diffusion Model Based on Neighborhood Attention [J]. Computer Engineering, 2025, 51(8): 262-269.
[3]	MIAO Ru, LI Yi, ZHOU Ke, ZHANG Yanna, CHANG Ranran, MENG Geng. A Study on Improved Faster R-CNN Model for Multi-Object Detection in Remote Sensing Images [J]. Computer Engineering, 2025, 51(8): 292-304.
[4]	HUA Jiabao, ZHANG Jingrui, ZHU Fumin, CHEN Lu. Adaptive Spatial Transformation Method for Vehicle Detection Based on Roadside Cameras [J]. Computer Engineering, 2025, 51(6): 349-359.
[5]	ZHANG Rui, ZHANG Xueying, CHEN Guijun, HUANG Lixia. Emotion Recognition in EEG Based on Granger Causality and Brain Regions Frequency Bands Transformer Model [J]. Computer Engineering, 2025, 51(6): 311-319.
[6]	DENG Zexian, ZHANG Yungui, ZHANG Lin. Research on Multi-Dimensional Time Series Classification Based on the Pre-Trained Recursive Transformer-Mixer [J]. Computer Engineering, 2025, 51(5): 154-165.
[7]	SUN Ziwen, QIAN Lizhi, YUAN Guanglin, YANG Chuandong, LING Chong. Transformer Object Tracking Method Based on Real-Time Dynamic Template Update [J]. Computer Engineering, 2025, 51(4): 158-168.
[8]	ZHANG Anqin, DING Zhifeng. Network Anomaly Detection Integrating Dynamic Graph Embedding and Transformer Autoencoder [J]. Computer Engineering, 2025, 51(4): 47-56.
[9]	WANG Yang, SONG Shijia, WANG Heqin, YUAN Zhenyu, ZHAO Lijun, WU Qilin. Estimation of Local Illumination Consistency Based on Improved Vision Transformer [J]. Computer Engineering, 2025, 51(2): 312-321.
[10]	ZHANG Hong, LI Feng, MA Yanhong, JI Wenxuan, ZHENG Qipeng. Photovoltaic Power Prediction with Optimized Transformer Integrating Pyramid Attention Module and Temporal Convolutional Network [J]. Computer Engineering, 2025, 51(10): 140-149.
[11]	YU Jie, ZHAO Chunlei, DONG Guozhong, REN Huaishuo, YOU Wei. Discovery of Nuisance Website Domain Name Generation Based on Domain Name Semantic Information and Similarity [J]. Computer Engineering, 2025, 51(10): 238-249.
[12]	ZHU Li, GAO Jingkai, ZHU Chunqiang, DENG Fan. Short-term Power Load Forecasting Based on Dynamic Multi-Scale and Dual Attention Mechanisms [J]. Computer Engineering, 2025, 51(10): 369-380.
[13]	ZHANG Tiansen, XU Xiaona, ZHAO Yue, ZHANG Xinning. MRI Liver Image Segmentation Based on Cascade Transformer and U-Net [J]. Computer Engineering, 2025, 51(10): 308-318.
[14]	YANG Hongju, JI Chang. Research on Learning-Driven Image Compression Algorithm [J]. Computer Engineering, 2025, 51(1): 190-197.
[15]	ZHOU Yu, XIE Wei, Kwong Tak Wu, JIANG Jianmin. Reconstruction of Video Snapshot Compressive Imaging Based on Triple Self-Attention [J]. Computer Engineering, 2025, 51(1): 20-30.

Please choose a citation manager

Content to export