作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (3): 317-325. doi: 10.19678/j.issn.1000-3428.0067134

• 开发研究与工程应用 • 上一篇    下一篇

基于注意力机制的人体关键点隐式建模网络

赵佳圆1,2,*(), 张玉茹1,2, 苏晓东1,2, 徐红岩1,2, 李世洲1,2, 张玉荣1,2   

  1. 1. 哈尔滨商业大学计算机与信息工程学院, 黑龙江 哈尔滨 150028
    2. 黑龙江省电子商务与智能信息处理重点实验室, 黑龙江 哈尔滨 150028
  • 收稿日期:2023-03-09 出版日期:2024-03-15 发布日期:2024-03-18
  • 通讯作者: 赵佳圆
  • 基金资助:
    黑龙江省自然科学基金(LH2022F035); 2022年哈尔滨商业大学教师“创新”项目支持计划项目(XL0068); 哈尔滨商业大学研究生创新科研项目(YJSCX2022-743HSD)

Implicit Modeling Network of Human Keypoints Based on Attention Mechanism

Jiayuan ZHAO1,2,*(), Yuru ZHANG1,2, Xiaodong SU1,2, Hongyan XU1,2, Shizhou LI1,2, Yurong ZHANG1,2   

  1. 1. School of Computer and Information Engineering, Harbin University of Commerce, Harbin 150028, Heilongjiang, China
    2. Heilongjiang Key Laboratory of Electronic Commerce and Intelligent Information Processing, Harbin 150028, Heilongjiang, China
  • Received:2023-03-09 Online:2024-03-15 Published:2024-03-18
  • Contact: Jiayuan ZHAO

摘要:

人体姿态估计任务需要利用视觉线索和关节间的解剖关系来定位关键点,但基于卷积神经网络的方法难以关注远程上下文线索和建模远距离关节之间的依赖关系。为此,提出一种基于注意力机制的隐式建模方法,通过多阶段迭代计算关节之间的特征相关性来隐式建模关键点间的约束关系,消除卷积神经网络的局部操作,扩大网络的感受野,建模远距离关节之间的依赖关系。为了解决网络在训练过程中可能弱化不可见关键点的问题,采用焦点损失函数,使网络更关注于复杂的关键点。使用目前精度最高的特征提取高分辨率网络(HRNet)和经典特征提取残差网络(ResNet)作为主干网络进行实验,结果表明,在同等实验条件下,隐式建模方法可以提高人体姿态估计网络的性能,在MPII和MSCOCO人体姿态估计基准数据集上,以HRNet网络为主干网络的算法相较于原网络,精度分别提升了1.7%和2.6%。

关键词: 人体姿态估计, 卷积神经网络, 注意力机制, 焦点损失, 隐式建模

Abstract:

Human pose estimation necessitates the use of visual cues and anatomical joint relationships to pinpoint key points. Existing Convolutional Neural Network(CNN) methods falter in addressing long-range contextual cues and modeling dependencies among distant joints. This paper introduces an attention-based implicit modeling method that iteratively computes feature correlations between joints, thus implicitly modeling the constraint relationships among key points. This method diverges from the localized operations characteristic of CNN by expanding the network's receptive field and modeling dependencies between distantly positioned joints. To counteract the diminished visibility of crucial keypoints during network training, a focal loss function is implemented, prompting the network to concentrate on complex keypoints. Comparative experiments were performed under identical conditions using the state-of-the-art High-Resolution Network(HRNet) and the classic Residual Network(ResNet) as backbone networks. Results reveal that the implicit modeling network enhances human pose estimation performance. For instance, utilizing HRNet as the backbone, the algorithm's accuracy on the MPII and MSCOCO human pose estimation benchmark datasets improved by 1.7% and 2.6%, respectively, surpassing the original network's performance.

Key words: human pose estimation, convolutional neural network, attention mechanism, focal loss, implicit modeling