作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (11): 283-293. doi: 10.19678/j.issn.1000-3428.0069680

• 图形图像处理 • 上一篇    下一篇

时空特征融合的高精度轻量级骨架行为识别

丁帅1,2,3, 况立群1,2,3,*(), 曹亚明1,2,3, 韩慧妍1,2,3, 熊风光1,2,3   

  1. 1. 中北大学计算机科学与技术学院,山西 太原 030051
    2. 机器视觉与虚拟现实山西省重点实验室,山西 太原 030051
    3. 山西省视觉信息处理及智能机器人工程研究中心,山西 太原 030051
  • 收稿日期:2024-04-01 修回日期:2024-05-21 出版日期:2025-11-15 发布日期:2024-09-02
  • 通讯作者: 况立群
  • 基金资助:
    山西省科技重大专项计划"揭榜挂帅"项目(202201150401021); 山西省科技成果转化引导专项(202104021301055); 山西省自然科学基金(202303021211153); 山西省自然科学基金(202203021222027); 山西省研究生实践创新项目(2023SJ215)

High-Precision and Lightweight Skeleton Behavior Recognition Based on Spatial-Temporal Feature Fusion

DING Shuai1,2,3, KUANG Liqun1,2,3,*(), CAO Yaming1,2,3, HAN Huiyan1,2,3, XIONG Fengguang1,2,3   

  1. 1. School of Computer Science and Technology, North University of China, Taiyuan 030051, Shanxi, China
    2. Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan 030051, Shanxi, China
    3. Shanxi Province's Vision Information Processing and Intelligent Robot Engineering Research Center, Taiyuan 030051, Shanxi, China
  • Received:2024-04-01 Revised:2024-05-21 Online:2025-11-15 Published:2024-09-02
  • Contact: KUANG Liqun

摘要:

传统基于RGB视频的人体行为识别方法在面对背景复杂、光照影响以及外貌变化时存在诸多挑战。相比之下, 利用人体骨架信息进行行为识别的方法受到的影响较小。然而, 目前主流的基于骨架的行为识别方法难以兼顾精度与复杂度。为了在保持高识别精度的同时解决模型参数量大、计算复杂度高的问题, 提出一种由3个新编码块组成的轻量化网络结构。首先, 在用于空间建模的自注意图卷积模块和用于时间建模的多尺度时间卷积模块中添加高效的多尺度注意力模块, 提高模型对时间和空间特征信息的识别和利用, 丰富骨架数据特征; 其次, 利用多特征融合自适应模块来增强特征融合与泛化能力; 最后, 使用迭代特征融合增强模块进一步加强对复杂特征关系的理解。实验结果表明, 在大规模数据集NTU-RGB+D60上, 所提方法在交叉主题评估(CS)和交叉视角评估(CV)下的准确率分别为91.1%和95.4%, 在数据集NTU-RGB+D120上, 该方法在CS和交叉设置评估(SS)下的准确率分别为87.3%和88.8%, 参数量为0.72×106, 浮点计算量为0.6×109。对比实验表明, 所提算法的参数量、浮点计算量以及识别精度均优于近年来的一些主流算法, 其有效地平衡了这些指标间的关系, 为人体行为的精确识别提供了轻量级网络模型。

关键词: 人体骨架, 行为识别, 轻量级, 图卷积, 特征融合

Abstract:

Traditional methods for human behavior recognition based on RGB videos face numerous challenges when dealing with complex backgrounds, lighting effects, and variations in appearance. By contrast, methods that leverage human skeletal information for behavior recognition are less affected by these factors. However, the current mainstream skeleton-based behavior recognition methods struggle to balance accuracy and complexity. To maintain high recognition accuracy while addressing issues such as large model parameter size and high computational complexity, a lightweight network structure comprising three novel encoding blocks is proposed. First, efficient multiscale attention modules are incorporated into the self-attention graph convolutional module for spatial modeling and the multiscale temporal convolutional module for temporal modeling, enhancing the ability of the model to recognize and utilize temporal and spatial feature information, thereby enriching skeletal data features. Second, a multifeature fusion adaptive module is employed to strengthen the feature fusion and generalization capabilities. Finally, an iterative feature fusion enhancement module is utilized to further improve the understanding of complex feature relationships. Experimental results demonstrate that, on the large-scale NTU-RGB+D60 dataset, the proposed method achieves accuracy rates of 91.1% and 95.4% under Cross-Subject (CS) and Cross-View (CV) evaluations, respectively. On the NTU-RGB+D120 dataset, it attains accuracy rates of 87.3% and 88.8% under CS and Cross-Setup (SS) evaluations, respectively, with a parameter count of 0.72×106 and a floating-point operation count of 0.6×109. Comparative experiments indicate that the proposed algorithm outperforms several mainstream algorithms in recent years in terms of parameter size, floating-point operation count, and recognition accuracy, effectively balancing the relationships among these metrics and providing a lightweight network model for precise human behavior recognition.

Key words: human skeleton, behavior recognition, lightweight, graph convolution, feature fusion