基于轻量级高分辨率网络的人体姿态估计算法

doi:10.19678/j.issn.1000-3428.0068375

摘要/Abstract

摘要：

人体姿态估计被广泛应用于运动健身、手势控制、无人超市、娱乐游戏等诸多领域, 但姿态估计任务仍面临着诸多挑战。针对目前主流的人体姿态估计网络参数量大、计算复杂度高的问题, 提出一种基于高分辨率网络的轻量级姿态估计网络(LitePose)。首先, 采用Ghost卷积降低特征提取网络的参数; 其次, 通过采用解耦的全连接(DFC)注意力模块, 更好地捕获远距离空间位置像素间的依赖关系, 减少由于参数量下降而导致的提取特征缺失, 提高人体关键点回归的准确率; 然后, 设计一个特征增强模块, 对骨干网络提取的特征进行进一步增强; 最后, 设计一个新的坐标解码方法, 降低热图解码过程中的误差, 提高关键点回归的准确率。在人体关键点检测数据集COCO和MPII上对LitePose进行验证, 并与当前的主流方法进行对比。实验结果表明, LitePose相比基线网络HRNet精度损失0.2%, 但参数量不及基线网络的1/3, LitePose在保证少量精度损失的同时能够大幅降低网络模型的参数量。

关键词: 人体姿态估计, 高分辨率网络, 轻量化网络, GhostV2, 坐标解码

Abstract:

Human pose estimation is widely used in multiple fields, including sports fitness, gesture control, unmanned supermarkets, and entertainment games. However, pose-estimation tasks face several challenges. Considering the current mainstream human pose-estimation networks with large parameters and complex calculations, LitePose, a lightweight pose-estimation network based on a high-resolution network, is proposed. First, Ghost convolution is used to reduce the parameters of the feature extraction network. Second, by using the Decoupled Fully Connected (DFC) attention module, the dependence relationship between pixels in the far distance space position is better captured and the loss in feature extraction due to decrease in parameters is reduced. The accuracy of human pose keypoint regression is improved, and a feature enhancement module is designed to further enhance the features extracted by the backbone network. Finally, a new coordinate decoding method is designed to reduce the error in the heatmap decoding process and improve the accuracy of keypoint regression. LitePose is validated on the human critical point detection datasets COCO and MPII and compared with current mainstream methods. The experimental results show that LitePose loses 0.2% accuracy compared to the baseline network HRNet; however, the number of parameters is less than one-third of the baseline network. LitePose can significantly reduce the number of parameters in the network model while ensuring minimal accuracy loss.

Key words: human pose estimation, high-resolution network, lightweight network, GhostV2, coordinate decoding

刘圣杰, 何宁, 王鑫, 于海港, 韩文静. 基于轻量级高分辨率网络的人体姿态估计算法[J]. 计算机工程, 2025, 51(2): 278-288.

LIU Shengjie, HE Ning, WANG Xin, YU Haigang, HAN Wenjing. Human Pose-Estimation Algorithm Based on Lightweight High-Resolution Network[J]. Computer Engineering, 2025, 51(2): 278-288.

https://www.ecice06.com/CN/Y2025/V51/I2/278

图/表 11

图1 HRNet网络结构

Fig.1 HRNet network structure

图2 LitePose网络结构

Fig.2 LitePose network structure

图3 解耦全连接注意力机制

Fig.3 Decoupled fully connected attention mechanism

图4 信息的流动过程

Fig.4 The process of information flow

图5 GhostV2的模块示意图

Fig.5 Schematic diagram of GhostV2′s modules

图6 特征增强模块结构

Fig.6 Structure of feature enhancement module

图7 LitePose与HRNet的可视化效果

Fig.7 Visualization effects of LitePose and HRNet

图8 NCD方法与基线方法的对比效果

Fig.8 Comparison effect of NCD method and baseline method

参考文献 41

1	ZHENG C , WU W H , CHEN C , et al. Deep learning-based human pose estimation: a survey. ACM Computing Surveys, 2024, 56 (1): 1- 37.
2	冯晓月, 宋杰. 二维人体姿态估计研究进展. 计算机科学, 2020, 47 (11): 128- 136.
	FENG X Y , SONG J . Advances in two-dimensional human pose estimation research. Computer Science, 2020, 47 (11): 128- 136.
3	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1706.03762.
4	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1810.04805v2.
5	RADFORD A, NARASIMHAN K. Improving language understanding by generative pre-training[EB/OL]. [2023-08-05]. https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf.
6	YANG A, PAN J S, LIN J Y, et al. Chinese CLIP: contrastive vision-language pretraining in Chinese[EB/OL]. [2023-08-05]. http://arxiv.org/abs/2211.01335v3.
7	AFKANPOUR A, ADEEL S, BASSANI H, et al. BERT for long documents: a case study of automated ICD coding[EB/OL]. [2023-08-05]. http://arxiv.org/abs/2211.02519v1.
8	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: transformers for image recognition at scale[EB/OL]. [2023-08-05]. https://arxiv.org/abs/2010.11929.
9	XU Y F, ZHANG J, ZHANG Q M, et al. ViTPose: simple vision transformer baselines for human pose estimation[EB/OL]. [2023-08-05]. https://arxiv.org/abs/2204.12484.
10	YUAN Y H, FU R, HUANG L, et al. HRFormer: high-resolution transformer for dense prediction[EB/OL]. [2023-08-05]. http://arxiv.org/abs/2110.09408v3.
11	MAO W A, GE Y T, SHEN C H, et al. TFPose: direct human pose estimation with transformers[EB/OL]. [2023-08-05]. http://arxiv.org/abs/2103.15320v1.
12	孙琪翔, 张睿哲, 何宁, 等. 基于非局部高分辨率网络的人体姿态估计方法. 计算机工程与应用, 2022, 58 (13): 227- 234.
	SUN Q X , ZHANG R Z , HE N , et al. Human pose estimation method based on non-local high-resolution networks. Computer Engineering and Applications, 2022, 58 (13): 227- 234.
13	ZOPH B, VASUDEVAN V, SHLENS J, et al. Learning transferable architectures for scalable image recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 8697-8710.
14	胡挺, 祝永新, 田犁, 等. 面向移动平台的轻量级卷积神经网络架构. 计算机工程, 2019, 45 (1): 17- 22. URL
	HU T , ZHU Y X , TIAN L , et al. Lightweight convolutional neural network architecture for mobile platforms. Computer Engineering, 2019, 45 (1): 17- 22. URL
15	高坤, 李汪根, 束阳, 等. 融入密集连接的多尺度轻量级人体姿态估计. 计算机工程与应用, 2022, 58 (24): 196- 204.
	GAO K , LI W G , SHU Y , et al. Multi-scale lightweight human pose estimation with dense connections. Computer Engineering and Applications, 2022, 58 (24): 196- 204.
16	HINTON G, VINYALS O, DEAN J. Distilling the knowledge in a neural network[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1503.02531v1.
17	ZAGORUYKO S, KOMODAKIS N. Paying more attention to attention: improving the performance of convolutional neural networks via attention transfer[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1612.03928v3.
18	HEO B, KIM J, YUN S, et al. A comprehensive overhaul of feature distillation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2019: 1921-1930.
19	HAN S, POOL J, TRAN J, et al. Learning both weights and connections for efficient neural network[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1506.02626.
20	GUO Y, YAO A, CHEN Y. Dynamic network surgery for efficient DNNs[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1608.04493.
21	HUANG Z H, WANG N Y. Data-driven sparse structure selection for deep neural networks[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1707.01213.
22	LUO J H , ZHANG H , ZHOU H Y , et al. ThiNet: pruning CNN filters for a thinner net. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (10): 2525- 2538.
23	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1704.04861v1.
24	ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]// Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 6848-6856.
25	HAN K, WANG Y H, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2020: 1580-1589.
26	YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 10440-10450.
27	CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2020: 5385-5394.
28	GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 14671-14681.
29	刘圣杰, 何宁, 于海港, 等. 引入坐标注意力和自注意力的人体关键点检测研究. 计算机工程, 2022, 48 (12): 86- 94. URL
	LIU S J , HE N , YU H G , et al. Research on human key point detection with coordinated attention and self-attention. Computer Engineering, 2022, 48 (12): 86- 94. URL
30	TSOTSOS J K . Analyzing vision at the complexity level. Behavioral and Brain Sciences, 1990, 13 (3): 423- 445.
31	TSOTSOS J K . A computational perspective on visual attention. Cambridge, USA: MIT Press, 2011.
32	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 7132-7141.
33	WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1910.03151.
34	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1807.06521.
35	CHEN Y, KALANTIDIS Y, LI J, et al. A²Nets: double attention networks[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1810.11579.
36	WANG X L, GIRSHICK R, GUPTA A, et al. Non-local neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 7794-7803.
37	CAO Y, XU J R, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[EB/OL]. [2023-08-05]. https://openaccess.thecvf.com/content_ICCVW_2019/papers/NeurArch/Cao_GCNet_Non-Local_Networks_Meet_Squeeze-Excitation_Networks_and_Beyond_ICCVW_2019_paper.pdf.
38	LIU J J, HOU Q B, CHENG M M, et al. Improving convolutional networks with self-calibrated convolutions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2020: 10096-10105.
39	GAO Z L, XIE J T, WANG Q L, et al. Global second-order pooling convolutional networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 3019-3028.
40	HUANG Z L, WANG X G, HUANG L C, et al. CCNet: criss-cross attention for semantic segmentation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2019: 603-612.
41	CHEN L C, ZHU Y K, PAPANDREOU G, et al. Encoder-decoder with atrous separable convolution for semantic image segmentation[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1802.02611.

[1]	袁亚剑, 毛力. 一种增强前景的轻量级交通标志检测模型[J]. 计算机工程, 2025, 51(3): 54-63.
[2]	张倡倡, 吕卫东, 蔡子杰, 刘炎奎. 基于域泛化的轻量化图像分类算法[J]. 计算机工程, 2025, 51(1): 182-189.
[3]	贵向泉, 刘世清, 李立, 秦庆松, 李唐艳. 基于改进YOLOv8的景区行人检测算法[J]. 计算机工程, 2024, 50(7): 342-351.
[4]	赵佳圆, 张玉茹, 苏晓东, 徐红岩, 李世洲, 张玉荣. 基于注意力机制的人体关键点隐式建模网络[J]. 计算机工程, 2024, 50(3): 317-325.
[5]	王翰文, 葛青, 朱宁可, 余鹏飞. 自然场景下的中国交通标志检测算法[J]. 计算机工程, 2024, 50(11): 327-337.
[6]	王款, 宣士斌, 何雪东, 李紫薇, 李嘉祥. 基于交叉注意力Transformer的人体姿态估计方法[J]. 计算机工程, 2023, 49(7): 223-231.
[7]	刘豪, 吴红兰, 房宇轩. 结合全局上下文信息的高效人体姿态估计[J]. 计算机工程, 2023, 49(7): 102-109.
[8]	钟宝荣, 吴夏灵. 基于高分辨率网络的轻量型人体姿态估计研究[J]. 计算机工程, 2023, 49(4): 226-232,239.
[9]	谢云旭, 吴锡, 彭静. 无锚框模型类梯度全局对抗样本生成[J]. 计算机工程, 2023, 49(10): 186-193.
[10]	曾雷鸣, 侯进, 陈子锐, 周浩然. 基于弱语义分割的轻量化交通标志检测网络[J]. 计算机工程, 2022, 48(9): 269-276,285.
[11]	张宝朋, 康谦泽, 李佳萌, 郭俊宇, 陈少华. 轻量化的YOLOv4目标检测算法[J]. 计算机工程, 2022, 48(8): 206-214.
[12]	罗梦诗, 徐杨, 叶星鑫. 融入双注意力的高分辨率网络人体姿态估计[J]. 计算机工程, 2022, 48(2): 314-320.
[13]	张法正, 杨娟, 汪荣贵, 薛丽霞. 基于动态自适应层叠网络的轻量化图像超分辨率重建[J]. 计算机工程, 2022, 48(12): 196-202.
[14]	刘圣杰, 何宁, 于海港, 王程, 韩文静. 引入坐标注意力和自注意力的人体关键点检测研究[J]. 计算机工程, 2022, 48(12): 86-94.
[15]	王柳程, 欧阳城添, 梁文. 基于改进特征金字塔网络的人体姿态估计[J]. 计算机工程, 2021, 47(8): 251-259,270.

选择文件类型/文献管理软件名称

选择包含的内容