作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (7): 223-231. doi: 10.19678/j.issn.1000-3428.0065330

• 图形图像处理 • 上一篇    下一篇

基于交叉注意力Transformer的人体姿态估计方法

王款, 宣士斌*, 何雪东, 李紫薇, 李嘉祥   

  1. 广西民族大学 人工智能学院,南宁 530006
  • 收稿日期:2022-07-22 出版日期:2023-07-15 发布日期:2023-07-14
  • 通讯作者: 宣士斌
  • 作者简介:

    王款(1995—),男,硕士研究生,主研方向为姿态估计

    何雪东,硕士研究生

    李紫薇,硕士研究生

    李嘉祥,硕士研究生

  • 基金资助:
    国家自然科学基金(61866003)

Human Pose Estimation Method Based on Cross Attention Transformer

Kuan WANG, Shibin XUAN*, Xuedong HE, Ziwei LI, Jiaxiang LI   

  1. College of Artificial Intelligence, Guangxi Minzu University, Nanning 530006, China
  • Received:2022-07-22 Online:2023-07-15 Published:2023-07-14
  • Contact: Shibin XUAN

摘要:

现有用于人体姿态估计的深度卷积网络方法大多采用堆叠Transformer编码器技术,未充分考虑低分辨率全局语义信息,存在模型学习困难、推理成本高等问题。提出基于交叉注意力的Transformer多尺度表征学习方法。利用深度卷积网络获取不同分辨率特征图,将特征图转变为多尺度视觉标记,并且预估关键点在标记空间中的分布提高模型的收敛速度。为增强低分辨率全局语义的可识别性,提出多尺度交叉注意力模块,该模块通过对不同分辨率特征标记之间的多次交互,以及对关键点标记采取移动关键点策略,实现减少关键点标记冗余和交叉融合操作次数,交叉注意力融合模块从特征标记中抽取的不同尺度特征信息形成关键点标记,有助于降低上采样融合的不准确性。在多项基准数据集上的实验结果表明,与当前最先进的TokenPose方法相比,该方法能有效促进Transformer编码器对关键点之间关联关系的学习,在不降低性能的前提下计算代价下降11.8%。

关键词: 全局语义, 多尺度交叉注意力, 人体姿态估计, 表征学习, 交叉注意力融合, Transformer编码器

Abstract:

Most existing deep convolutional network methods for human pose estimation use the stacked Transformer encoder without fully considering the low-resolution global semantic information, thus resulting in difficulties in model learning, high inference costs, and other problems.Hence, a multiscale feature representation based on the cross-attention Transformer is proposed.First, the deep convolutional network is used to obtain feature maps with different resolutions.Subsequently, these feature maps are transformed into multiscale visual tokens and the distribution of keypoints in the token space is predicted, thus improving the convergence speed of the model. To improve the identifiability of low-resolution global semantics, a multiscale cross attention module is proposed.The module reduces the redundancy of key-point tokens and the number of cross-fusion operations through multiple interactions between feature tokens of different resolutions and by shifting keypoints.Finally, the cross-attention fusion module extracts feature information of different scales from feature tokens to form keypoint tokens, thus reducing the inaccuracy of fusion.Experimental results on multiple benchmark datasets show that the effectiveness of the cross-attention and fusion modules facilitates the Transformer encoder in learning the correlation of keypoints. Compared with the current state-of-the-art TokenPose, the proposed method reduces the computational cost by 11.8% without degrading performance.

Key words: global semantic, multi-scale cross attention, human pose estimation, representation learning, cross attention fusion, Transformer encoder