时空特征融合的高精度轻量级骨架行为识别

doi:10.19678/j.issn.1000-3428.0069680

计算机工程 ›› 2025, Vol. 51 ›› Issue (11): 283-293. doi: 10.19678/j.issn.1000-3428.0069680

时空特征融合的高精度轻量级骨架行为识别

丁帅¹^,²^,³, 况立群¹^,²^,³^,*(), 曹亚明¹^,²^,³, 韩慧妍¹^,²^,³, 熊风光¹^,²^,³

1. 中北大学计算机科学与技术学院，山西太原 030051
2. 机器视觉与虚拟现实山西省重点实验室，山西太原 030051
3. 山西省视觉信息处理及智能机器人工程研究中心，山西太原 030051

收稿日期:2024-04-01 修回日期:2024-05-21 出版日期:2025-11-15 发布日期:2024-09-02
通讯作者: 况立群
基金资助:
山西省科技重大专项计划"揭榜挂帅"项目(202201150401021); 山西省科技成果转化引导专项(202104021301055); 山西省自然科学基金(202303021211153); 山西省自然科学基金(202203021222027); 山西省研究生实践创新项目(2023SJ215)

High-Precision and Lightweight Skeleton Behavior Recognition Based on Spatial-Temporal Feature Fusion

DING Shuai¹^,²^,³, KUANG Liqun¹^,²^,³^,*(), CAO Yaming¹^,²^,³, HAN Huiyan¹^,²^,³, XIONG Fengguang¹^,²^,³

1. School of Computer Science and Technology, North University of China, Taiyuan 030051, Shanxi, China
2. Shanxi Key Laboratory of Machine Vision and Virtual Reality, Taiyuan 030051, Shanxi, China
3. Shanxi Province's Vision Information Processing and Intelligent Robot Engineering Research Center, Taiyuan 030051, Shanxi, China

Received:2024-04-01 Revised:2024-05-21 Online:2025-11-15 Published:2024-09-02
Contact: KUANG Liqun

摘要/Abstract

摘要：

传统基于RGB视频的人体行为识别方法在面对背景复杂、光照影响以及外貌变化时存在诸多挑战。相比之下, 利用人体骨架信息进行行为识别的方法受到的影响较小。然而, 目前主流的基于骨架的行为识别方法难以兼顾精度与复杂度。为了在保持高识别精度的同时解决模型参数量大、计算复杂度高的问题, 提出一种由3个新编码块组成的轻量化网络结构。首先, 在用于空间建模的自注意图卷积模块和用于时间建模的多尺度时间卷积模块中添加高效的多尺度注意力模块, 提高模型对时间和空间特征信息的识别和利用, 丰富骨架数据特征; 其次, 利用多特征融合自适应模块来增强特征融合与泛化能力; 最后, 使用迭代特征融合增强模块进一步加强对复杂特征关系的理解。实验结果表明, 在大规模数据集NTU-RGB+D60上, 所提方法在交叉主题评估(CS)和交叉视角评估(CV)下的准确率分别为91.1%和95.4%, 在数据集NTU-RGB+D120上, 该方法在CS和交叉设置评估(SS)下的准确率分别为87.3%和88.8%, 参数量为0.72×10⁶, 浮点计算量为0.6×10⁹。对比实验表明, 所提算法的参数量、浮点计算量以及识别精度均优于近年来的一些主流算法, 其有效地平衡了这些指标间的关系, 为人体行为的精确识别提供了轻量级网络模型。

关键词: 人体骨架, 行为识别, 轻量级, 图卷积, 特征融合

Abstract:

Traditional methods for human behavior recognition based on RGB videos face numerous challenges when dealing with complex backgrounds, lighting effects, and variations in appearance. By contrast, methods that leverage human skeletal information for behavior recognition are less affected by these factors. However, the current mainstream skeleton-based behavior recognition methods struggle to balance accuracy and complexity. To maintain high recognition accuracy while addressing issues such as large model parameter size and high computational complexity, a lightweight network structure comprising three novel encoding blocks is proposed. First, efficient multiscale attention modules are incorporated into the self-attention graph convolutional module for spatial modeling and the multiscale temporal convolutional module for temporal modeling, enhancing the ability of the model to recognize and utilize temporal and spatial feature information, thereby enriching skeletal data features. Second, a multifeature fusion adaptive module is employed to strengthen the feature fusion and generalization capabilities. Finally, an iterative feature fusion enhancement module is utilized to further improve the understanding of complex feature relationships. Experimental results demonstrate that, on the large-scale NTU-RGB+D60 dataset, the proposed method achieves accuracy rates of 91.1% and 95.4% under Cross-Subject (CS) and Cross-View (CV) evaluations, respectively. On the NTU-RGB+D120 dataset, it attains accuracy rates of 87.3% and 88.8% under CS and Cross-Setup (SS) evaluations, respectively, with a parameter count of 0.72×10⁶ and a floating-point operation count of 0.6×10⁹. Comparative experiments indicate that the proposed algorithm outperforms several mainstream algorithms in recent years in terms of parameter size, floating-point operation count, and recognition accuracy, effectively balancing the relationships among these metrics and providing a lightweight network model for precise human behavior recognition.

Key words: human skeleton, behavior recognition, lightweight, graph convolution, feature fusion

丁帅, 况立群, 曹亚明, 韩慧妍, 熊风光. 时空特征融合的高精度轻量级骨架行为识别[J]. 计算机工程, 2025, 51(11): 283-293.

DING Shuai, KUANG Liqun, CAO Yaming, HAN Huiyan, XIONG Fengguang. High-Precision and Lightweight Skeleton Behavior Recognition Based on Spatial-Temporal Feature Fusion[J]. Computer Engineering, 2025, 51(11): 283-293.

https://www.ecice06.com/CN/Y2025/V51/I11/283

图/表 17

图1 骨架多模态表示

Fig.1 Skeletal multimodal representation

图2 结构化剪枝后的baseline模型

Fig.2 baseline model after structured pruning

图3 网络结构

Fig.3 Network structure

图4 时空特征增强编码块

Fig.4 Spatial-temporal feature enhancement encoded block

图5 STFE编码块的关节点注意力示意图

Fig.5 Schematic diagram of joint point attention in STFE encoded block

图6 骨架序列图

Fig.6 Skeleton sequence diagram

图7 多特征融合自适应编码块

Fig.7 Multi-feature fusion adaptive encoded block

图8 迭代特征融合增强编码块

Fig.8 Iterative feature fusion enhancement encoded block

图9 NTU-RGB+D数据集示例

Fig.9 Examples of the NTU-RGB+D dataset

图10 训练过程曲线

Fig.10 Training process curve

图11 NTU-RGB+D120数据集上的模型参数量与识别准确率

Fig.11 Model parameter quantity and recognition accuracy on the NTU-RGB+D120 dataset

图12 NTU-RGB+D120数据集上的模型计算量与识别准确率

Fig.12 Model computational complexity and recognition accuracy on the NTU-RGB+D120 dataset

参考文献 30

1	HU K , JIN J L , ZHENG F , et al. Overview of behavior recognition based on deep learning. Artificial Intelligence Review, 2023, 56 (3): 1833- 1865. doi: 10.1007/s10462-022-10210-8
2	LI Y , HUANG J , TIAN F , et al. Gesture interaction in virtual reality. Virtual Reality & Intelligent Hardware, 2019, 1 (1): 84- 112.
3	LIU W X , ZHONG X , ZHOU Z , et al. Dual-recommendation disentanglement network for view fuzz in action recognition. IEEE Transactions on Image Processing, 2023, 32, 2719- 2733. doi: 10.1109/TIP.2023.3273459
4	SUN Z H , KE Q H , RAHMANI H , et al. Human action recognition from various data modalities: a review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 45 (3): 3200- 3225.
5	于海港, 何宁, 刘圣杰, 等. 基于时空信息融合的人体行为识别研究. 计算机工程与应用, 2023, 59 (3): 202- 208.
	YU H G , HE N , LIU S J , et al. Research on human behavior recognition based on temporal and spatial information fusion. Computer Engineering and Applications, 2023, 59 (3): 202- 208.
6	SHU X B , ZHANG L Y , QI G J , et al. Spatiotemporal co-attention recurrent neural networks for human-skeleton motion prediction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44 (6): 3300- 3315. doi: 10.1109/TPAMI.2021.3050918
7	牛为华, 翟瑞冰. 基于改进3D ResNet的视频人体行为识别方法研究. 计算机工程与科学, 2023, 45 (10): 1814- 1821.
	NIU W H , ZHAI R B . A video human behavior recognition method based on improved 3D ResNet. Computer Engineering & Science, 2023, 45 (10): 1814- 1821.
8	REN B, LIU M, DING R, et al. A survey on 3D skeleton-based action recognition using learning method[EB/OL]. [2023-10-05]. https://arxiv.org/abs/2002.05907.
9	YAN S J, XIONG Y J, LIN D H. Spatial temporal graph convolutional networks for skeleton-based action recognition[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1801.07455.
10	SHI L, ZHANG Y F, CHENG J, et al. Two-stream adaptive graph convolutional networks for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2019: 12026-12035.
11	LIU Z Y, ZHANG H W, CHEN Z H, et al. Disentangling and unifying graph convolutions for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 143-152.
12	ZHANG P F, LAN C L, ZENG W J, et al. Semantics-guided neural networks for efficient skeleton-based human action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 1112-1121.
13	刘锁兰, 王炎, 王洪元, 等. 基于多流语义图卷积网络的人体行为识别. 计算机工程, 2024, 50 (8): 64- 74. doi: 10.19678/j.issn.1000-3428.0067977
	LIU S L , WANG Y , WANG H Y , et al. Human behavior recognition based on multi-stream semantic graph convolutional network. Computer Engineering, 2024, 50 (8): 64- 74. doi: 10.19678/j.issn.1000-3428.0067977
14	CHENG K, ZHANG Y F, HE X Y, et al. Skeleton-based action recognition with shift graph convolutional network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 183-192.
15	CHI H G, HA M H, CHI S, et al. InfoGCN: representation learning for human skeleton-based action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 20186-20196.
16	WANG K X , DENG H M . TFC-GCN: lightweight temporal feature cross-extraction graph convolutional network for skeleton-based action recognition. Sensors, 2023, 23 (12): 5593. doi: 10.3390/s23125593
17	MOSTAFA A, PENG W, ZHAO G Y. Hyperbolic spatial temporal graph convolutional networks[C]//Proceedings of the IEEE International Conference on Image Processing (ICIP). Washington D.C., USA: IEEE Press, 2022: 3301-3305.
18	VAITESSWAR U S, YEO C K. Multi-range mixed graph convolution network for skeleton-based action recognition[C]//Proceedings of the 5th Asia Pacific Information Technology Conference. Washington D.C., USA: IEEE Press, 2023: 49-54.
19	KIM T S, REITER A. Interpretable 3D human action analysis with temporal convolutional networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2017: 1623-1631.
20	CHEN Z , LI S C , YANG B , et al. Multi-scale spatial temporal graph convolutional network for skeleton-based action recognition. Proceedings of the AAAI Conference on Artificial Intelligence, 2021, 35 (2): 1113- 1122. doi: 10.1609/aaai.v35i2.16197
21	XING Y L , ZHU J , LI Y , et al. An improved spatial temporal graph convolutional network for robust skeleton-based action recognition. Applied Intelligence, 2023, 53 (4): 4592- 4608. doi: 10.1007/s10489-022-03589-y
22	WU L Y , ZHANG C , ZOU Y X . Spatio temporal focus for skeleton-based action recognition. Pattern Recognition, 2023, 136, 109231. doi: 10.1016/j.patcog.2022.109231
23	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 7132-7141.
24	OUYANG D L, HE S, ZHANG G Z, et al. Efficient multi-scale attention module with cross-spatial learning[C]//Proceedings of the ICASSP 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Washington D.C., USA: IEEE Press, 2023: 1-5.
25	DAI Y M, GIESEKE F, OEHMCKE S, et al. Attentional feature fusion[C]//Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2021: 3560-3569.
26	SHAHROUDY A, LIU J, NG T T, et al. NTU-RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 1010-1019.
27	LIU J , SHAHROUDY A , PEREZ M , et al. NTU-RGB+D 120:a large-scale benchmark for 3D human activity understanding. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42 (10): 2684- 2701. doi: 10.1109/TPAMI.2019.2916873
28	CHEN Y X, ZHANG Z Q, YUAN C F, et al. Channel-wise topology refinement graph convolution for skeleton-based action recognition[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2021: 13359-13368.
29	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 2818-2826.
30	WANG S, ZHANG Y, ZHAO M, et al. Skeleton-based action recognition via temporal-channel aggregation[EB/OL]. [2023-10-05]. https://arxiv.org/abs/2205.15936.

[1]	徐式芃, 王雷, 盛捷. 基于知识图谱的异常个体提前识别模型研究[J]. 计算机工程, 2025, 51(9): 59-70.
[2]	马跃, 黄周睿, 周雯, 许艺瀚. 基于感受野注意力的轻量化林火检测算法[J]. 计算机工程, 2025, 51(9): 350-361.
[3]	翟志鹏, 曹阳, 沈琴琴, 施佺. 基于多时空图融合与动态注意力的交通流预测[J]. 计算机工程, 2025, 51(9): 139-148.
[4]	陈晓雷, 王荣. 多分支多尺度点云补全网络[J]. 计算机工程, 2025, 51(8): 330-340.
[5]	马满福, 陈嘉豪, 李勇, 张聪. 基于改进GAT的多特征融合谣言检测模型MFLAN[J]. 计算机工程, 2025, 51(8): 181-189.
[6]	闫建红, 刘芝妍, 王震. 融合时空注意力机制的多尺度卷积车辆轨迹预测[J]. 计算机工程, 2025, 51(8): 406-414.
[7]	刘春霞, 孟吉星, 潘理虎, 龚大立. 融合RGB与IR图像的遥感小目标检测方法[J]. 计算机工程, 2025, 51(7): 326-338.
[8]	张佳承, 韦锦, 陈义时. 改进YOLOv8的实时轻量化鲁棒绿篱检测算法[J]. 计算机工程, 2025, 51(7): 362-374.
[9]	栾孟娜, 郑秋梅, 王风华. 基于DMC-YOLO的交通标志实时检测算法[J]. 计算机工程, 2025, 51(7): 90-99.
[10]	沙宇洋, 陆京涛, 杜浩凡, 翟小兵, 孟维宇, 廉旭, 罗刚, 李克峰. 适用于导盲场景的多尺度特征融合轻量化道路图像分割算法[J]. 计算机工程, 2025, 51(7): 314-325.
[11]	余鹏, 杨佳琦, 陈欣然, 贺超波. 基于二部图对比学习的特征增强推荐算法[J]. 计算机工程, 2025, 51(7): 100-110.
[12]	周莎, 车生兵, 考友琛, 张旭, 郭甚驿. 基于特征选择和时空特征的网络入侵检测[J]. 计算机工程, 2025, 51(7): 223-231.
[13]	曹蓓, 赵奎. 基于双重情感和多特征融合的虚假新闻检测[J]. 计算机工程, 2025, 51(6): 193-203.
[14]	刘凯, 任洪逸, 李蓥, 季怡, 刘纯平. 基于交叉模态注意力特征增强的医学视觉问答[J]. 计算机工程, 2025, 51(6): 49-56.
[15]	黄梓芃, 曾碧卿, 陈鹏飞, 周斯颖. 基于语言特征增强的方面情感三元组抽取[J]. 计算机工程, 2025, 51(6): 83-92.

选择文件类型/文献管理软件名称

选择包含的内容

时空特征融合的高精度轻量级骨架行为识别

High-Precision and Lightweight Skeleton Behavior Recognition Based on Spatial-Temporal Feature Fusion

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 17

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

时空特征融合的高精度轻量级骨架行为识别

High-Precision and Lightweight Skeleton Behavior Recognition Based on Spatial-Temporal Feature Fusion

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 17

参考文献 30

相关文章 15

编辑推荐

Metrics

本文评价