Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2024, Vol. 50 ›› Issue (4): 342-349. doi: 10.19678/j.issn.1000-3428.0067715

• Development Research and Engineering Application • Previous Articles     Next Articles

Fusing Global-Local Contextual Information for Small Object Multi-Person Pose Estimation

Chenzhi LONG, Ping CHEN*(), Chuankun LI   

  1. Shanxi Key Laboratory of Signal Capturing and Processing, North University of China, Taiyuan 030051, Shanxi, China
  • Received:2023-05-29 Online:2024-04-15 Published:2023-08-17
  • Contact: Ping CHEN

融合全局-局部上下文信息的小目标多人姿态估计

龙辰志, 陈平*(), 李传坤   

  1. 中北大学信息探测与处理山西省重点实验室, 山西 太原 030051
  • 通讯作者: 陈平
  • 基金资助:
    国家自然科学基金(62101512); 山西省青年科学基金(20210302124031)

Abstract:

Despite advancements, existing multi-person 2D pose estimation methods cannot effectively identify the poses of small objects. A multi-person pose estimation method that integrates global and local contextual information is proposed to address this problem. The method uses the different scale features output by a High-Resolution Network (HRNet) to roughly locate multiple anatomical centers of the human body, thereby providing more supervisory information to small objects through multiple center points to improve their localization ability. The coordinates of the human center point are used as a clue to extract local contextual information of different scales near the center point through deformable sampling, whereby the comparative loss between the local contextual information of different objects is calculated to improve the discriminative ability between objects. Using the low-resolution features of HRNet as global contextual information and local contextual information as cross-attention queries, a multilayer Transformer model is constructed by combining global and local contextual information to enhance the contextual information of small objects. This enhanced information is then used as clustering centers, and multi-scale fusion features are decoupled to obtain keypoint heatmaps corresponding to different objects to achieve multi-person pose estimation of small objects. The experimental results show that the propoesd method can effectively improve the recognition performance of small object poses, realizing an Average Precision (AP) of 69.0% on the COCO test-dev2017 dataset and an APM improvement of 1.4 percentage points compared to Dual Anatomical Centers(DAC).

Key words: pose estimation, small object, multiple center points, attention, contextual information

摘要:

尽管多人2D姿态估计方法趋近成熟, 但是现有方法无法有效识别小目标的姿态。针对当前小目标姿态难以识别的问题, 提出一种融合全局-局部上下文信息的多人姿态估计方法。利用高分辨率网络(HRNet)输出的不同尺度特征对人体的多个解剖中心进行粗糙的定位, 通过多个中心点给小目标提供更多的监督信息, 提高对小目标的定位能力。以定位的人体中心点坐标为线索, 通过可变形采样的方式提取中心点附近不同尺度的局部上下文信息, 并计算不同目标局部上下文信息之间的对比损失以提高目标之间的判别能力。以HRNet网络的低分辨率特征作为全局上下文信息, 以局部上下文信息作为交叉注意力的查询, 结合全局和局部上下文信息构建多层Transformer模型, 增强小目标的上下文信息。将增强的小目标上下文信息作为聚类中心, 解耦多尺度融合的特征得到不同目标对应的关键点热图, 从而实现小目标多人姿态估计。实验结果表明, 该方法能够有效提高小目标姿态的识别性能, 在COCO test-dev2017数据集上取得了69.0%的平均精度(AP), APM比对偶解剖中心(DAC)方法提高1.4个百分点。

关键词: 姿态估计, 小目标, 多中心点, 注意力, 上下文信息