Fusing Global-Local Contextual Information for Small Object Multi-Person Pose Estimation

doi:10.19678/j.issn.1000-3428.0067715

Abstract

Abstract:

Despite advancements, existing multi-person 2D pose estimation methods cannot effectively identify the poses of small objects. A multi-person pose estimation method that integrates global and local contextual information is proposed to address this problem. The method uses the different scale features output by a High-Resolution Network (HRNet) to roughly locate multiple anatomical centers of the human body, thereby providing more supervisory information to small objects through multiple center points to improve their localization ability. The coordinates of the human center point are used as a clue to extract local contextual information of different scales near the center point through deformable sampling, whereby the comparative loss between the local contextual information of different objects is calculated to improve the discriminative ability between objects. Using the low-resolution features of HRNet as global contextual information and local contextual information as cross-attention queries, a multilayer Transformer model is constructed by combining global and local contextual information to enhance the contextual information of small objects. This enhanced information is then used as clustering centers, and multi-scale fusion features are decoupled to obtain keypoint heatmaps corresponding to different objects to achieve multi-person pose estimation of small objects. The experimental results show that the propoesd method can effectively improve the recognition performance of small object poses, realizing an Average Precision (AP) of 69.0% on the COCO test-dev2017 dataset and an AP^M improvement of 1.4 percentage points compared to Dual Anatomical Centers(DAC).

Key words: pose estimation, small object, multiple center points, attention, contextual information

摘要：

尽管多人2D姿态估计方法趋近成熟, 但是现有方法无法有效识别小目标的姿态。针对当前小目标姿态难以识别的问题, 提出一种融合全局-局部上下文信息的多人姿态估计方法。利用高分辨率网络(HRNet)输出的不同尺度特征对人体的多个解剖中心进行粗糙的定位, 通过多个中心点给小目标提供更多的监督信息, 提高对小目标的定位能力。以定位的人体中心点坐标为线索, 通过可变形采样的方式提取中心点附近不同尺度的局部上下文信息, 并计算不同目标局部上下文信息之间的对比损失以提高目标之间的判别能力。以HRNet网络的低分辨率特征作为全局上下文信息, 以局部上下文信息作为交叉注意力的查询, 结合全局和局部上下文信息构建多层Transformer模型, 增强小目标的上下文信息。将增强的小目标上下文信息作为聚类中心, 解耦多尺度融合的特征得到不同目标对应的关键点热图, 从而实现小目标多人姿态估计。实验结果表明, 该方法能够有效提高小目标姿态的识别性能, 在COCO test-dev2017数据集上取得了69.0%的平均精度(AP), AP^M比对偶解剖中心(DAC)方法提高1.4个百分点。

关键词: 姿态估计, 小目标, 多中心点, 注意力, 上下文信息

Chenzhi LONG, Ping CHEN, Chuankun LI. Fusing Global-Local Contextual Information for Small Object Multi-Person Pose Estimation[J]. Computer Engineering, 2024, 50(4): 342-349.

龙辰志, 陈平, 李传坤. 融合全局-局部上下文信息的小目标多人姿态估计[J]. 计算机工程, 2024, 50(4): 342-349.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067715

http://www.ecice06.com/EN/Y2024/V50/I4/342

Figures/Tables 13

Fig.1 Overall architecture of model

Fig.2 The strategies of center point partition

Fig.3 Structure of contextual information extraction

Fig.4 The implementation process of deformable sampling

Fig.5 Structure of Transformer decoder

Fig.6 Inference time under different numbers of person

Fig.7 Visualization results of the proposed method on COCO test-dev2017 dataset

References 34

1	SONG L C, YU G, YUAN J S, et al. Human pose estimation and its application to action recognition: a survey. Journal of Visual Communication and Image Representation, 2021, 76, 103055. doi: 10.1016/j.jvcir.2021.103055
2	HE K M, GKIOXARI G, DOLLÁR P, et al. Mask R-CNN[C]//Proceedings of the International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 2961-2969.
3	XIAO B, WU H P, WEI Y C. Simple baselines for human pose estimation and tracking[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1804.06208.pdf.
4	KHIRODKAR R, CHARI V, AGRAWAL A, et al. Multi-instance pose networks: rethinking top-down pose estimation[C]//Proceedings of the International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 3122-3131.
5	CAO Z, GINES H, SIMON T, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 7291-7299.
6	KREISS S, BERTONI L, ALAHI A. PifPaf: composite fields for human pose estimation[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2019: 11977-11986.
7	CHENG B W, XIAO B, WANG J D, et al. HigherHRNet: scale-aware representation learning for bottom-up human pose estimation[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 5386-5395.
8	SUN K, XIAO B, LIU D, et al. Deep high-resolution representation learning for human pose estimation[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1902.09212.pdf.
9	WANG D, ZHANG S. Contextual instance decoupling for robust multi-person pose estimation[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 11060-11068.
10	CHENG Y, AI Y H, WANG B, et al. Bottom-up 2D pose estimation via dual anatomical centers for small-scale persons. Pattern Recognition, 2023, 139, 109403. doi: 10.1016/j.patcog.2023.109403
11	LIU S G, LI Y, HUA G G. Human pose estimation in video via structured space learning and halfway temporal evaluation. IEEE Transactions on Circuits and Systems for Video Technology, 2019, 29(7): 2029- 2038. doi: 10.1109/TCSVT.2018.2858828
12	DU S L, WANG H, YUAN Z W, et al. Bi-Pose: bidirectional 2D-3D Transformation for human pose estimation from a monocular camera[J/OL]. IEEE Transactions on Automation Science and Engineering: 1-14[2023-04-21]. https://ieeexplore.ieee.org/document/10141872.
13	LI J N, LIANG X D, WEI Y C, et al. Perceptual generative adversarial networks for small object detection[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 1222-1230.
14	KISANTAL M, WOJNA Z, MURAWSKI J, et al. Augmentation for small object detection[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1902.07296.pdf.
15	LIN T Y, DOLLÁR P, GIRSHICK R, et al. Feature pyramid networks for object detection[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2117-2125.
16	LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1803.01534.pdf.
17	HU H, GU J Y, ZHANG Z, et al. Relation networks for object detection[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1711.11575.pdf.
18	JADERBERG M, SIMONYAN K, ZISSERMAN A, et al. Spatial transformer networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2015: 2017-2025.
19	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
20	LI Y J, ZHANG S K, WANG Z C, et al. TokenPose: learning keypoint Tokens for human pose estimation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2021: 11313-11322.
21	SHI D, WEI X, LI L, et al. End-to-end multi-person pose estimation with transformers[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2022: 11069-11078.
22	刘豪, 吴红兰, 房宇轩. 结合全局上下文信息的高效人体姿态估计. 计算机工程, 2023, 49(7): 102-109, 117. URL
	LIU H, WU H L, FANG Y X. Efficient human pose estimation combining global contextual information. Computer Engineering, 2023, 49(7): 102-109, 117. URL
23	王款, 宣士斌, 何雪东, 等. 基于交叉注意力变换器的人体姿态估计算法[J/OL]. 计算机工程: 1-10[2023-06-27]. DOI: 10.19678/j.issn.1000-3428.0065330.
	WANG K, XUAN S B, HE X D, et al. Cross attention transformer for human pose estimation[J/OL]. Computer Engineering: 1-10[2023-06-27]. DOI: 10.19678/j.issn.1000-3428.0065330. (in Chinese)
24	ZHOU X Y, WANG D Q, KRÄHENBÜHL P. Objects as points[EB/OL]. [2023-04-21]. https://arxiv.org/abs/1904.07850v1.
25	SOFIIUK K, BARINOVA O, KONUSHIN A. AdaptIS: adaptive instance selection network[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 7355-7363.
26	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(2): 318- 327. doi: 10.1109/TPAMI.2018.2858826
27	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
28	DAI J F, QI H Z, XIONG Y W, et al. Deformable convolutional networks[EB/OL]. [2023-04-21]. https://arxiv.org/pdf/1703.06211.pdf.
29	HE K M, FAN H Q, WU Y X, et al. Momentum contrast for unsupervised visual representation learning[EB/OL]. [2023-04-21]. https://arxiv.org/abs/1911.05722v3.
30	YU Q, WANG H K, QIAO S, et al. K-means mask transformer[C]//Proceedings of the 17th European Conference on Computer Vision. Berlin, Germany: Springer, 2022: 288-307.
31	DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: Transformers for image recognition at scale[EB/OL]. [2023-04-21]. https://arxiv.org/abs/2010.11929v1.
32	LUO Z X, WANG Z C, HUANG Y, et al. Rethinking the heatmap regression for bottom-up human pose estimation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 13264-13273.
33	GENG Z G, SUN K, XIAO B, et al. Bottom-up human pose estimation via disentangled keypoint regression[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 14676-14686.
34	XUE N, WU T F, XIA G S, et al. Learning local-global contextual adaptation for multi-person pose estimation[EB/OL]. [2023-04-21]. https://arxiv.org/abs/2109.03622v2.

[1]	Ruikang LIU, Weiming LIU, Mengfei DUAN, Wei XIE, Yuan DAI. Metro Platform Foreign Object Detection Based on Dual-channel Transformer [J]. Computer Engineering, 2024, 50(4): 197-207.
[2]	Shuai HU, Hualing LI, Dechen HAO. Improved Multistage Edge-Enhanced Medical Image Segmentation Network of U-Net [J]. Computer Engineering, 2024, 50(4): 286-293.
[3]	Anzheng WANG, Jianwu DANG, Biao YUE, Jingyu YANG. Road Crack Detection Based on Position Information and Attention Mechanism [J]. Computer Engineering, 2024, 50(4): 303-312.
[4]	Minghu WANG, Zhikui SHI, Jia SU, Xinsheng ZHANG. Sequence Recommendation Method Based on RoBERTa and Graph-Enhanced Transformer [J]. Computer Engineering, 2024, 50(4): 121-131.
[5]	Haipeng WU, Yurong QIAN, Hongyong LENG. Multimodal Relation Extraction Based on Bidirectional Attention Mechanism [J]. Computer Engineering, 2024, 50(4): 160-167.
[6]	Zhenlu LI, Wei HUANG, Kai SUN. Research on Lightweight Road-Target-Recognition Algorithm in Complex Environment [J]. Computer Engineering, 2024, 50(4): 219-227.
[7]	Yu AN, Haibo GE, Wenhao HE, Sai MA, Mengyang CHENG. Siamese Network Tracking Algorithm Based on Compensated Attention Mechanism [J]. Computer Engineering, 2024, 50(4): 187-196.
[8]	Yudan YANG, Junhua ZHANG, Yunfeng LIU. Segmentation of Spine Computed Tomography Images Based on Three-Dimensional Recurrent Residual Convolution [J]. Computer Engineering, 2024, 50(4): 237-246.
[9]	ZHANG Chi, WANG Zhong, JIANG Tianhao, XIE Kangmin. Speech Enhancement Network Based on Parallel Multi-Attention [J]. Computer Engineering, 2024, 50(4): 68-77.
[10]	DU Tiantian, WANG Xiaolong, HE Jing. Optical-flow-based Waterway Velocity Detection Algorithm Under Complex Illumination Conditions [J]. Computer Engineering, 2024, 50(4): 60-67.
[11]	Mingxu MA, Hong MA, Huawei SONG. Pose Estimation Algorithm for Small Target Pedestrians in Urban Street View Based on YOLO-Pose [J]. Computer Engineering, 2024, 50(4): 177-186.
[12]	LI Jingcan, XIAO Cuilin, QIN Xiaoting, XIE Xia. Text-Relation-Extraction Algorithm Based on Large-Language Model and Semantic Enhancement [J]. Computer Engineering, 2024, 50(4): 87-94.
[13]	Jida ZHAO, Guoyong ZHEN, Chengqun CHU. Unmanned Aerial Vehicle Image Target Detection Algorithm Based on YOLOv8 [J]. Computer Engineering, 2024, 50(4): 113-120.
[14]	Yanhong LIU, Qiuxiang YANG, Shuai HU. Research on Multi-Scale Feature Fusion Dehazing Network Based on Feature Differences [J]. Computer Engineering, 2024, 50(4): 247-257.
[15]	Mingcheng YU, Yagu DANG, Qilin WU, Xu JI, Kexin BI. Research on Automatic Scoring for English Essay Based on Multi-Scale Context [J]. Computer Engineering, 2024, 50(3): 259-266.

Please choose a citation manager

Content to export