基于运动-时间感知的人体动作识别方法

doi:10.19678/j.issn.1000-3428.0068398

摘要/Abstract

摘要：

针对动作视频中存在冗余信息及动作信息的特征通道分布稀疏问题, 提出一种基于运动-时间感知的3D残差网络。利用运动感知模块(AM)计算特征级别的时间差来激励运动敏感通道, 以此获取运动特征; 通过时间注意力模块(TM)沿着时间维度计算注意力权重矩阵, 以获取局部时间特征。将AM模块和TM模块的计算结果相加, 得到动作信息的融合特征, 再加入到3D残差网络中, 以此构造基于运动-时间感知模块(ATM)的3D残差网络。实验结果表明, 在公共数据集UCF101和HMDB51上, 相对于3DResNeXt-101网络, 基于ATM模块的3DResNeXt-101网络的动作识别准确率分别提升1.6%和2.8%, 说明所提方法具有可行性和有效性。

关键词: 深度学习, 动作识别, 运动感知, 时间注意力, 3D残差网络

Abstract:

To address the problem of redundant information in action videos and the sparse distribution of feature channels in action information, a 3D residual network based on action-time perception is proposed. The Action-perception Module (AM) calculates temporal differences at the feature level. The motion features can be obtained by utilizing these differences to excite the action-sensitive channel. The Temporal attention Module (TM) focuses on the attention weight matrix along the time dimension to determine the local time features. The fusion features of action information can be obtained by combining the results of the AM and TM. The fusion feature is then incorporated into the 3D convolution network to construct an Action-Time perception Module (ATM)-based Three-Dimensional Convolutional Neural Network (3DCNN) action recognition network. The experimental results show that on the public datasets UCF101 and HMDB51, the action recognition accuracy of the 3DResNeXt-101 network based on the ATM module is improved by 1.6% and 2.8%, respectively, compared with that of the 3DResNeXt-101 network, indicating the feasibility and effectiveness of the proposed method.

Key words: deep learning, action recognition, action-perception, temporal attention, 3D residual network

王晓路, 汶建荣. 基于运动-时间感知的人体动作识别方法[J]. 计算机工程, 2025, 51(1): 216-224.

WANG Xiaolu, WEN Jianrong. Human Action Recognition Method Based on Action-Time Perception[J]. Computer Engineering, 2025, 51(1): 216-224.

https://www.ecice06.com/CN/Y2025/V51/I1/216

图/表 10

图1 基于运动-时间感知的3D残差网络

Fig.1 3D residual network based on action-time perception

图2 运动感知模块

Fig.2 Action perception module

图3 时间注意力模块

Fig.3 Temporal attention module

图4 2种连接方式

Fig.4 Two connection methods

图5 基于ATM的3DResNet-50在2个数据集上的准确率

Fig.5 Accuracy of 3DResNet-50 based on ATM on two datasets

图6 各个网络在2个数据集上的热力图

Fig.6 Heat maps of various networks on two datasets

参考文献 37

1	XU H, YAN R. Research on sports action recognition system based on cluster regression and improved ISA deep network. Journal of Intelligent & Fuzzy Systems, 2020, 39(4): 5871- 5881.
2	罗会兰, 王婵娟, 卢飞. 视频行为识别综述. 通信学报, 2018, 39(6): 169- 180.
	LUO H L, WANG C J, LU F. Survey of video behavior recognition. Journal on Communications, 2018, 39(6): 169- 180.
3	罗会兰, 童康, 孔繁胜. 基于深度学习的视频中人体动作识别进展综述. 电子学报, 2019, 47(5): 1162- 1173.
	LUO H L, TONG K, KONG F S. The progress of human action recognition in videos based on deep learning: a review. Acta Electronica Sinica, 2019, 47(5): 1162- 1173.
4	LIU X. Sports deep learning method based on cognitive human behavior recognition. Computational Intelligence and Neuroscience, 2022, 2022, 2913507.
5	YAO G L, LEI T, ZHONG J D. A review of convolutional-neural-network-based action recognition. Pattern Recognition Letters, 2019, 118, 14- 22. doi: 10.1016/j.patrec.2018.05.018
6	ZHANG H B, ZHANG Y X, ZHONG B, et al. Acomprehensive survey of vision-based human action recognition methods. Sensors (Basel, Switzerland), 2019, 19(5): E1005. doi: 10.3390/s19051005
7	石跃祥, 朱茂清. 基于骨架动作识别的协作卷积Transformer网络. 电子与信息学报, 2023, 45(4): 1485- 1493.
	SHI Y X, ZHU M Q. Collaborative convolutional Transformer network based on skeleton action recognition. Journal of Electronics & Information Technology, 2023, 45(4): 1485- 1493.
8	赵俊男, 佘青山, 孟明, 等. 基于多流空间注意力图卷积SRU网络的骨架动作识别. 电子学报, 2022, 50(7): 1579- 1585.
	ZHAO J N, SHE Q S, MENG M, et al. Skeleton action recognition based on multi-stream spatial attention graph convolutional SRU network. Acta Electronica Sinica, 2022, 50(7): 1579- 1585.
9	王辉, 宋佳豪, 丁铂栩, 等. 三角形网格序列表示的人体动作识别. 计算机辅助设计与图形学学报, 2022, 34(11): 1723- 1730.
	WANG H, SONG J H, DING B X, et al. Human action recognition of triangle mesh sequence representation. Journal of Computer-Aided Design & Computer Graphics, 2022, 34(11): 1723- 1730.
10	王洪雁, 袁海. 基于骨骼及表观特征融合的动作识别方法. 通信学报, 2022, 43(1): 138- 148.
	WANG H Y, YUAN H. Action recognition method based on fusion of skeleton and apparent features. Journal on Communications, 2022, 43(1): 138- 148.
11	LIN J, GAN C, HAN S. TSM: temporal shift module for efficient video understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2019: 7083-7093.
12	MAJUMDER S, KEHTARNAVAZ N. Vision and inertial sensing fusion for human action recognition: a review. IEEE Sensors Journal, 2021, 21(3): 2454- 2467.
13	WANG L, HUYNH D Q, KONIUSZ P. Acomparative review of recent kinect-based action recognition algorithms. IEEE Transactions on Image Processing, 2020, 29, 15- 28.
14	TU Z, ZHANG J, LI H, et al. Joint-bone fusion graph convolutional network for semi-supervised skeleton action recognition[EB/OL]. [2023-08-05]. https://arxiv.org/abs/2202.04075?context=cs.CV.
15	LI Z Y, GAVRILYUK K, GAVVES E, et al. VideoLSTM convolves, attends and flows for action recognition. Computer Vision and Image Understanding, 2018, 166, 41- 50.
16	胡正平, 刁鹏成, 张瑞雪, 等. 3D多支路聚合轻量网络视频行为识别算法研究. 电子学报, 2020, 48(7): 1261- 1268.
	HU Z P, DIAO P C, ZHANG R X, et al. Research on 3D multi-branch aggregated lightweight network video action recognition algorithm. Acta Electronica Sinica, 2020, 48(7): 1261- 1268.
17	谢昭, 周义, 吴克伟, 等. 基于时空关注度LSTM的行为识别. 计算机学报, 2021, 44(2): 261- 274.
	XIE Z, ZHOU Y, WU K W, et al. Activity recognition based on spatial-temporal attention LSTM. Chinese Journal of Computers, 2021, 44(2): 261- 274.
18	张小俊, 李辰政, 孙凌宇, 等. 基于改进3D卷积神经网络的行为识别. 计算机集成制造系统, 2019, 25(8): 2000- 2006.
	ZHANG X J, LI C Z, SUN L Y, et al. Behavior recognition method based on improved 3D convolutional neural network. Computer Integrated Manufacturing Systems, 2019, 25(8): 2000- 2006.
19	于明理. 基于三维卷积神经网络的实时视频动作分类关键技术研究[D]. 北京: 北京邮电大学, 2019.
	YU M L. Research on key technologies of real-time video action classification based on 3D convolutional neural network[D]. Beijing: Beijing University of Posts and Telecommunications, 2019. (in Chinese)
20	石祥滨, 李怡颖, 刘芳, 等. T-STAM: 基于双流时空注意力机制的端到端的动作识别模型. 计算机应用研究, 2021, 38(4): 1235-1239, 1276.
	SHI X B, LI Y Y, LIU F, et al. T-STAM: end-to-end action recognition model based on two-stream network with spatio-temporal attention mechanism. Application Research of Computers, 2021, 38(4): 1235-1239, 1276.
21	HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 6546-6555.
22	KAY W, CARREIRA J, SIMONYAN K, et al. The Kinetics human action video dataset[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1705.06950v1.
23	范银行, 赵海峰, 张少杰. 基于3D卷积残差网络的人体动作识别算法. 计算机应用研究, 2020, 37(S2): 300-301, 304.
	FAN Y H, ZHAO H F, ZHANG S J. Human action recognition algorithm based on 3D convolutional residual network. Application Research of Computers, 2020, 37(S2): 300-301, 304.
24	GAO Y J, YANG F Y, YU Q H, et al. Three-dimensional porous Cu@Cu₂O aerogels for direct voltammetric sensing of glucose. Microchimica Acta, 2019, 186(3): 192.
25	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[EB/OL]. [2023-08-05]. https://link.springer.com/chapter/10.1007/978-3-030-01234-2_1.
26	CAI J H, HU J G. 3DRANs: 3D residual attention networks for action recognition. The Visual Computer, 2020, 36(6): 1261- 1270.
27	高德勇, 康自兵, 王松, 等. 利用卷积块注意力机制识别人体动作的方法. 西安电子科技大学学报, 2022, 49(4): 144-155, 200.
	GAO D Y, KANG Z B, WANG S, et al. Method for recognizing human actions using convolutional block attention mechanism. Journal of Xidian University, 2022, 49(4): 144-155, 200.
28	LI Y, JI B, SHI X T, et al. TEA: temporal excitation and aggregation for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2020: 909-918.
29	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 7132-7141.
30	SOOMRO K, ZAMIR A R, SHAH M. UCF101: a dataset of 101 human actions classes from videos in the wild[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1212.0402v1.
31	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of the International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2011: 2556-2563.
32	WANG Z W, SHE Q, SMOLIC A. ACTION-net: multipath excitation for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 13209-13218.
33	ZHUANG D F, JIANG M, KONG J, et al. Spatiotemporal attention enhanced features fusion network for action recognition. International Journal of Machine Learning and Cybernetics, 2021, 12(3): 823- 841.
34	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the Kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 6299-6308.
35	ZHU J G, ZHU Z, ZOU W. End-to-end video-level representation learning for action recognition[C]//Proceedings of the 24th International Conference on Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 645-650.
36	DIBA A L, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1711.08200v1.
37	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[EB/OL]. [2023-08-05]. https://arxiv.org/abs/1608.00859.

[1]	周宇, 谢威, 邝得互, 江健民. 基于三元自注意力的视频快照压缩成像重建[J]. 计算机工程, 2025, 51(1): 20-30.
[2]	胡升龙, 陈彬, 张开华, 宋慧慧. 场景结构知识增强的协同显著性目标检测[J]. 计算机工程, 2025, 51(1): 31-41.
[3]	喻勇涛, 孙奥, 李昂, 朱琳琳. 基于孪生网络的分类器输出重复性优化方法[J]. 计算机工程, 2025, 51(1): 118-127.
[4]	张会影, 圣文顺. 基于标记适应的人脸年龄识别优化算法[J]. 计算机工程, 2025, 51(1): 174-181.
[5]	杨红菊, 吉昌. 学习驱动的图像压缩算法研究[J]. 计算机工程, 2025, 51(1): 190-197.
[6]	火久元, 苏泓瑞, 武泽宇, 王婷娟. 基于改进YOLOv8的道路交通小目标车辆检测算法[J]. 计算机工程, 2025, 51(1): 246-257.
[7]	王骞, 张俊华, 王泽彤, 李博. X2S-Net:基于双平面X线片的脊柱三维重建[J]. 计算机工程, 2025, 51(1): 277-286.
[8]	易鹏, 杨晔, 严仕嘉. 基于MPCNN模型的sEMG快速迁移学习的手势识别应用研究[J]. 计算机工程, 2025, 51(1): 304-311.
[9]	刘兆伟, 方艳红, 郑明宇, 锁斌. 基于注意力机制与多任务的肺部疾病诊断方法[J]. 计算机工程, 2025, 51(1): 332-342.
[10]	魏嵬, 丁香香, 郭梦星, 杨钊, 刘辉. 文本相似度计算方法综述[J]. 计算机工程, 2024, 50(9): 18-32.
[11]	朱凯, 李理, 张彤, 江晟, 别一鸣. 基于Transformer的多阶段运动模糊图像修复网络[J]. 计算机工程, 2024, 50(9): 276-285.
[12]	张天鹏, 韩晶, 吕学强. 基于多任务学习的超分辨率辅助小目标检测[J]. 计算机工程, 2024, 50(9): 304-312.
[13]	高煜宝, 文志诚. 基于注意力机制的双路解码器图像去噪方法[J]. 计算机工程, 2024, 50(9): 324-332.
[14]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[15]	张亚洲, 和玉, 戎璐, 王祥凯. 基于上下文知识增强型Transformer网络的抑郁检测[J]. 计算机工程, 2024, 50(8): 75-85.

选择文件类型/文献管理软件名称

选择包含的内容