结合全局上下文信息的高效人体姿态估计

doi:10.19678/j.issn.1000-3428.0065022

摘要/Abstract

摘要：

现有的人体姿态估计模型通常使用复杂的网络结构提升关键点检测准确率，忽视了模型参数量和复杂度，使得模型难以部署在资源受限的计算设备上。针对这一问题，构建一个感知全局上下文信息的轻量级人体姿态估计网络模型（GCEHNet）。对HRNet进行轻量化改进，使用深度卷积模块代替HRNet结构中的标准3×3残差卷积模块，在保证网络性能的同时大幅度降低模型参数量与复杂度。为了克服卷积神经网络(CNN)在长期语义依赖性建模方面的局限性，使用双支路方法联合CNN与Transformer，将全局位置信息嵌入CNN后期模块，使GCEHNet模型能感知上下文特征信息，从而提升网络性能。设计一种CNN特征与全局位置特征高效融合的策略，通过学习联合特征信息重新分配特征权重，捕获并增强来自不同感受野的特征信息。实验结果表明，GCEHNet模型在MS COCO val2017和test-dev2017数据集上的检测准确率分别达到71.6%和71.3%，相比于HRNet模型，在检测准确率仅损失4.5%的条件下参数量降低了76.4%，在检测准确率和模型复杂度间取得了较好的平衡。

关键词: 人机交互, 人体姿态估计, 自注意力机制, 全局上下文信息, 特征融合

Abstract:

Complex models are typically used to enhance the accuracy of human keypoint detection, where the number of parameters and the complexity of the model are disregarded, thus rendering it difficult to deploy the model on devices with limited computational resources.Hence, a lightweight human pose estimation network model known as GCEHNet, which senses global contextual information, is constructed. Using a deep convolutional module instead of the standard 3×3 residual module in the High-Resolution Network(HRNet) structure, the complexity of the parameters and model is substantially reduced while the network performance is ensured.To overcome the limitations of Convolutional Neural Network(CNN) in modeling long-term semantic dependencies, a two-branch approach is used to combine the CNN and Transformer, where global location information is embedded into the late CNN module, thus enabling the perception of contextual feature information and improving the network performance.A strategy for the efficient fusion of CNN features with global location features is designed to capture and enhance feature information from different sensory fields by learning to reassign feature weights to the joint feature information.Experimental results show that the GCEHNet model achieves 71.6% and 71.3% detection accuracies on the MS COCO val2017 and test-dev2017 datasets, respectively, compare to the HRNet model, which reduces the number of participants by 76.4% under a 4.5% loss in detection accuracy.The GCEHNet model achieves good balance between detection accuracy and model complexity.

Key words: human-machine interaction, human pose estimation, self-attention mechanism, global contextual information, feature fusion

刘豪, 吴红兰, 房宇轩. 结合全局上下文信息的高效人体姿态估计[J]. 计算机工程, 2023, 49(7): 102-109.

Hao LIU, Honglan WU, Yuxuan FANG. Efficient Human Pose Estimation Combining Global Contextual Information[J]. Computer Engineering, 2023, 49(7): 102-109.

https://www.ecice06.com/CN/Y2023/V49/I7/102

图/表 10

参考文献 43

1	TOSHEV A, SZEGEDY C. DeepPose: human pose estimation via deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2014: 1653-1660.
2	LI S J, LIU Z Q, CHAN A B. Heterogeneous multi-task learning for human pose estimation with deep convolutional neural network. International Journal of Computer Vision, 2015, 113(1): 19- 36. doi: 10.1007/s11263-014-0767-8
3	ZHANG W Q, FANG J M, WANG X G, et al. EfficientPose: efficient human pose estimation with neural architecture search. Computational Visual Media, 2021, 7(3): 335- 347. doi: 10.1007/s41095-021-0214-z
4	XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 472-487.
5	ANDRILUKA M, PISHCHULIN L, GEHLER P, et al. 2D human pose estimation: new benchmark and state of the art analysis[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2014: 3686-3693.
6	JOHNSON S, EVERINGHAM M. Combining discriminative appearance and segmentation cues for articulated human pose estimation[C]//Proceedings of the 12th International Conference on Computer Vision Workshops. Washington D. C., USA: IEEE Press, 2010: 405-412.
7	LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2014: 740-755.
8	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5686-5696.
9	LIU H, LIU F, FAN X, et al. Polarized self-attention: towards high-quality pixel-wise regression[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2107.00782.
10	TAN M X, CHEN B, PANG R M, et al. MnasNet: platform-aware neural architecture search for mobile[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 2815-2823.
11	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
12	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2010.11929.
13	BROWN T B, MANN B, RYDER N, et al. Language models are few-shot learners[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2005.14165.
14	LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision Transformer using shifted windows[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2022: 9992-10002.
15	XIONG Z, WANG C, LI Y, et al. Swin-Pose: swin Transformer based human pose estimation[C]//Proceedings of the 5th International Conference on Multimedia Information Processing and Retrieval. Washington D. C., USA: IEEE Press, 2022: 1-10.
16	XIAO T T, SINGH M, MINTUN E, et al. Early convolutions help Transformers see better[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2106.14881.
17	LIU Z, MAO H Z, WU C Y, et al. A ConvNet for the 2020s[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 11966-11976.
18	SRINIVAS A, LIN T Y, PARMAR N, et al. Bottleneck Transformers for visual recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 16514-16524.
19	TOUVRON H, CORD M, DOUZE M, et al. Training data-efficient image Transformers & distillation through attention[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2012.12877.
20	HOWARD A G, ZHU M L, CHEN B, et al. MobileNets: efficient convolutional neural networks for mobile vision applications[EB/OL]. [2022-05-11]. https://arxiv.org/abs/1704.04861.
21	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 4510-4520.
22	HU J, SHEN L, ALBANIE S, et al. Squeeze-and-excitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(8): 2011- 2023. doi: 10.1109/TPAMI.2019.2913372
23	WEI S H, RAMAKRISHNA V, KANADE T, et al. Convolutional pose machines[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 4724-4732.
24	NEWELL A, YANG K Y, DENG J. Stacked Hourglass networks for human pose estimation[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 483-499.
25	CHEN Y L, WANG Z C, PENG Y X, et al. Cascaded pyramid network for multi-person pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7103-7112.
26	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 386- 397. doi: 10.1109/TPAMI.2018.2844175
27	YU C Q, XIAO B, GAO C X, et al. Lite-HRNet: a lightweight high-resolution network[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 10435-10445.
28	CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 1302-1310.
29	CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5385-5394.
30	YANG S, QUAN Z B, NIE M, et al. TransPose: keypoint localization via Transformer[C]//Proceedings of IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2022: 11782-11792.
31	LIN K, WANG L J, LIU Z C. End-to-end human pose and mesh reconstruction with Transformers[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 1954-1963.
32	ZHANG J L, TU Z G, YANG J Y, et al. MixSTE: seq2seq mixed spatio-temporal encoder for 3D human pose estimation in video[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 13222-13232.
33	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 770-778.
34	AZAD R, HEIDARI M, WU Y L, et al. Contextual attention network: Transformer meets U-net. Berlin, Germany: Springer, 2022.
35	XU Y F, ZHANG J, ZHANG Q M, et al. ViTPose: simple vision Transformer baselines for human pose estimation[EB/OL]. [2022-05-11]. https://arxiv.org/abs/2204.12484.
36	WANG Q L, WU B G, ZHU P F, et al. ECANet: efficient channel attention for deep convolutional neural networks[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 11531-11539.
37	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 1-8.
38	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 13708-13717.
39	GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 14671-14681.
40	MA N N, ZHANG X Y, ZHENG H T, et al. ShuffleNetV2: practical guidelines for efficient CNN architecture design[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 122-138.
41	CHEN Y P, DAI X Y, LIU M C, et al. Dynamic ReLU[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 351-367.
42	PAPANDREOU G, ZHU T, KANAZAWA N, et al. Towards accurate multi-person pose estimation in the wild[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 3711-3719.
43	FANG H S, XIE S Q, TAI Y W, et al. RMPE: regional multi-person pose estimation[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 2353-2362.

输入尺寸/像素	通道数/个	Stage1	Stage2	Stage3	Stage4
64×48	32	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
32×24	64		$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
16×12	128			$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
8×6	256				$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2

输入尺寸/像素	通道数/个	Stage1	Stage2	Stage3	Stage4
64×48	32	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
32×24	64		$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
16×12	128			$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2	$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2
8×6	256				$ \left[1\times 1, \mathrm{D}\mathrm{W}\mathrm{C}\mathrm{o}\mathrm{n}\mathrm{v}3\times \mathrm{3, 1}\times 1\right] $×2

模型	预训练	输入尺寸/像素	参数量/10⁶	GFLOPs	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	AR/%
SimpleBaseline	是	256×192	34.0	8.90	70.4	88.6	78.3	67.1	77.2	76.3
HRNet	否	256×192	28.5	7.10	73.4	89.5	80.7	70.2	80.1	78.9
MobileNetV2	是	256×192	9.6	1.48	64.6	87.4	72.3	61.1	71.2	70.7
MobileNetV2	是	384×288	9.6	3.33	67.3	87.9	74.3	62.8	74.7	72.9
CPN	是	256×192	27.0	6.20	68.6
Lite-HRNet-18	否	256×192	1.1	0.20	64.8	86.7	73.0	62.1	70.5	71.2
Lite-HRNet-30	否	256×192	1.8	0.31	67.2	88.0	75.0	64.3	73.1	73.3
DAEK	是	128×96	63.6	3.60	71.9	89.1	79.6	69.2	78.0	77.9
ShuffleNetV2	是	256×192	7.6	1.28	59.9	85.4	66.3	56.6	66.2	66.4
ShuffleNetV2	是	384×288	7.6	2.87	63.6	86.5	70.5	59.5	70.7	69.7
DY-MobileNetV2	是	256×192	16.1	1.01	68.2	88.4	76.0	65.0	74.7	74.2
DY-ReLU	是	256×192	9.0	1.03	68.1	88.5	76.2	64.8	74.3
GCEHNet	是	256×192	6.7	2.24	70.1	88.5	77.3	67.2	76.3	76.0
GCEHNet	是	384×288	6.7	4.64	71.6	89.4	79.6	69.0	78.9	77.9

模型	预训练	输入尺寸/像素	参数量/10⁶	GFLOPs	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	AR/%
SimpleBaseline	是	256×192	34.0	8.90	70.4	88.6	78.3	67.1	77.2	76.3
HRNet	否	256×192	28.5	7.10	73.4	89.5	80.7	70.2	80.1	78.9
MobileNetV2	是	256×192	9.6	1.48	64.6	87.4	72.3	61.1	71.2	70.7
MobileNetV2	是	384×288	9.6	3.33	67.3	87.9	74.3	62.8	74.7	72.9
CPN	是	256×192	27.0	6.20	68.6
Lite-HRNet-18	否	256×192	1.1	0.20	64.8	86.7	73.0	62.1	70.5	71.2
Lite-HRNet-30	否	256×192	1.8	0.31	67.2	88.0	75.0	64.3	73.1	73.3
DAEK	是	128×96	63.6	3.60	71.9	89.1	79.6	69.2	78.0	77.9
ShuffleNetV2	是	256×192	7.6	1.28	59.9	85.4	66.3	56.6	66.2	66.4
ShuffleNetV2	是	384×288	7.6	2.87	63.6	86.5	70.5	59.5	70.7	69.7
DY-MobileNetV2	是	256×192	16.1	1.01	68.2	88.4	76.0	65.0	74.7	74.2
DY-ReLU	是	256×192	9.0	1.03	68.1	88.5	76.2	64.8	74.3
GCEHNet	是	256×192	6.7	2.24	70.1	88.5	77.3	67.2	76.3	76.0
GCEHNet	是	384×288	6.7	4.64	71.6	89.4	79.6	69.0	78.9	77.9

模型	输入尺寸/像素	参数量/10⁶	GFLOPs	AP/%	AP⁵⁰/%	AP⁷⁵/%	AP^M/%	AP^L/%	AR/%
SimpleBaseline	384×288	68.6	35.60	73.7	91.9	81.1	70.3	80.0	79.0
HRNet	256×192	28.5	16.00	74.9	92.5	82.8	71.3	80.9	80.1
HRNet	384×288	63.6	32.90	75.5	92.5	83.3	71.9	81.5	80.5
MobileNetV2	384×288	9.8	3.33	66.8	90.0	74.0	62.6	73.3	72.3
CPN	384×288			72.1	91.4	80.0	68.7	77.2	78.5
Mask-RCNN				63.1	87.3	68.7	57.8	71.4
Lite-HRNet-18	384×288	1.1	0.45	66.9	89.4	74.4	64.0	72.2	72.6
Lite-HRNet-30	384×288	1.8	0.70	69.7	90.7	77.5	66.9	75.0	75.4
DAEK	128×96	63.6	32.90	76.2	92.5	83.6	72.5	82.4	81.1
ShuffleNetV2	384×288	7.6	2.87	62.9	88.5	69.4	58.9	69.3	68.9
G-RMI	353×275	42.6	57.00	64.9	85.5	71.3	62.3	70.0	69.7
RMPE	320×256	28.1	26.70	72.3	89.2	79.1	68.0	78.6
GCEHNet	384×288	6.7	4.64	71.3	91.0	79.3	68.4	77.6	78.2

选择文件类型/文献管理软件名称

选择包含的内容