作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (7): 102-109. doi: 10.19678/j.issn.1000-3428.0065022

• 人工智能与模式识别 • 上一篇    下一篇

结合全局上下文信息的高效人体姿态估计

刘豪, 吴红兰*, 房宇轩   

  1. 南京航空航天大学 民航学院, 南京 211106
  • 收稿日期:2022-06-20 出版日期:2023-07-15 发布日期:2023-07-14
  • 通讯作者: 吴红兰
  • 作者简介:

    刘豪(1995—),男,硕士研究生,主研方向为目标检测

    房宇轩,本科生

  • 基金资助:
    国家自然科学基金委员会-中国民用航空局民航联合研究基金(U2033202); 国家自然科学基金委员会-中国民用航空局民航联合研究基金(U1333119)

Efficient Human Pose Estimation Combining Global Contextual Information

Hao LIU, Honglan WU*, Yuxuan FANG   

  1. College of Civil Aviation, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
  • Received:2022-06-20 Online:2023-07-15 Published:2023-07-14
  • Contact: Honglan WU

摘要:

现有的人体姿态估计模型通常使用复杂的网络结构提升关键点检测准确率,忽视了模型参数量和复杂度,使得模型难以部署在资源受限的计算设备上。针对这一问题,构建一个感知全局上下文信息的轻量级人体姿态估计网络模型(GCEHNet)。对HRNet进行轻量化改进,使用深度卷积模块代替HRNet结构中的标准3×3残差卷积模块,在保证网络性能的同时大幅度降低模型参数量与复杂度。为了克服卷积神经网络(CNN)在长期语义依赖性建模方面的局限性,使用双支路方法联合CNN与Transformer,将全局位置信息嵌入CNN后期模块,使GCEHNet模型能感知上下文特征信息,从而提升网络性能。设计一种CNN特征与全局位置特征高效融合的策略,通过学习联合特征信息重新分配特征权重,捕获并增强来自不同感受野的特征信息。实验结果表明,GCEHNet模型在MS COCO val2017和test-dev2017数据集上的检测准确率分别达到71.6%和71.3%,相比于HRNet模型,在检测准确率仅损失4.5%的条件下参数量降低了76.4%,在检测准确率和模型复杂度间取得了较好的平衡。

关键词: 人机交互, 人体姿态估计, 自注意力机制, 全局上下文信息, 特征融合

Abstract:

Complex models are typically used to enhance the accuracy of human keypoint detection, where the number of parameters and the complexity of the model are disregarded, thus rendering it difficult to deploy the model on devices with limited computational resources.Hence, a lightweight human pose estimation network model known as GCEHNet, which senses global contextual information, is constructed. Using a deep convolutional module instead of the standard 3×3 residual module in the High-Resolution Network(HRNet) structure, the complexity of the parameters and model is substantially reduced while the network performance is ensured.To overcome the limitations of Convolutional Neural Network(CNN) in modeling long-term semantic dependencies, a two-branch approach is used to combine the CNN and Transformer, where global location information is embedded into the late CNN module, thus enabling the perception of contextual feature information and improving the network performance.A strategy for the efficient fusion of CNN features with global location features is designed to capture and enhance feature information from different sensory fields by learning to reassign feature weights to the joint feature information.Experimental results show that the GCEHNet model achieves 71.6% and 71.3% detection accuracies on the MS COCO val2017 and test-dev2017 datasets, respectively, compare to the HRNet model, which reduces the number of participants by 76.4% under a 4.5% loss in detection accuracy.The GCEHNet model achieves good balance between detection accuracy and model complexity.

Key words: human-machine interaction, human pose estimation, self-attention mechanism, global contextual information, feature fusion