Human action recognition method based on action-time perception

doi:10.19678/j.issn.1000-3428.0068398

Abstract

Abstract: To tackle the problem of redundant information in action video and the sparse distribution of feature channels in action information, a 3D residual network based on action-time-perception is proposed. The action- perception module (AM) calculates the temporal differences of feature level. The motion features can be obtained by utilizing these differences to excite the action-sensitive channel. The temporal attention module (TM) works out the attention weight matrix along the time dimension, based on which, the local time features are determined. The fusion features of action information are acquired by combining the results of the AM module and the TM module, and then the fusion feature is joined into the 3D convolution network, which construct an action-time-perception module (ATM) based 3DCNN action recognition network. Experimental results show that on the public datasets UCF101 and HMDB51, the accuracy of the action recognition of the 3DResNeXt-101 network based on the ATM module is improved by 1.6% and 2.8%, respectively, indicating that the method proposed in this paper is feasible and effective.

摘要： 针对动作视频中存在冗余信息及动作信息的特征通道分布稀疏问题，提出了一种基于运动-时间感知的3D残差网络。运动感知模块(Action- Perception Module,AM)计算特征级别的时间差来激励运动敏感通道以此获取运动特征；时间注意力模块(Temporal-Attention Module,TM)沿着时间维度计算注意力权重矩阵获取局部时间特征。将AM模块和TM模块计算结果相加，得到动作信息的融合特征，再加入到3D残差网络中，以此构造基于运动-时间感知模块(Action-Time-Perception Module,ATM)的3D残差网络。实验结果表明，在公共数据集UCF101和HMDB51上，基于ATM模块的3DResNeXt-101网络相对于3DResNeXt-101网络的动作识别的准确率分别提升了1.6 %和2.8 %，说明本文所提出的方法是可行、有效的。

WANG Xiaolu , WEN Jianrong. Human action recognition method based on action-time perception[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0068398.

王晓路, 汶建荣. 基于运动-时间感知的人体动作识别方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0068398.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068398

References

[1] Xu, Hui and Rongfang Yan.Research on sports action recognition system based on cluster regression and improved ISA deep network[J].2020, Intell. Fuzzy Syst., 39, 5871-5881.
[2] 罗会兰,王婵娟,卢飞.视频行为识别综述[J].通信学报,2018,39(06):169-180.
[3] 罗会兰,童康,孔繁胜.基于深度学习的视频中人体动作识别进展综述[J].电子学报,2019,47(05):1162-1173.
[4] Liu X. Sports Deep Learning Method Based on Cognitive Human Behavior Recognition[J].Computational Intelligence and Neuroscience, 2022, 2022.
[5] Yao, Guangle, Tao Lei, and Jiandan Zhong. A review of convolutional-neural-network-based action recognition[J]. Pattern Recognition Letters, 2019, 118: 14-22.
[6] Zhang H B, Zhang Y X, Zhong B, et al. A comprehensive survey of vision-based human action recognition methods[J]. Sensors, 2019, 19(5): 1005.
[7] 石跃祥 , 朱茂清 . 基于骨架动作识别的协作卷积 Transformer 网络 [J]. 电子与信息学报,2023,45(04):1485-1493.
[8] 赵俊男,佘青山,孟明,陈云.基于多流空间注意力图卷积 SRU 网络的骨架动作识别 [J]. 电子学报,2022,50(07):1579-1585.
[9] 王辉,宋佳豪,丁铂栩,何鹏,曹俊杰.三角形网格序列表示的人体动作识别[J].计算机辅助设计与图形学学报,2022,34(11):1723-1730.
[10] 王洪雁,袁海.基于骨骼及表观特征融合的动作识别方法 [J].通信学报,2022,43(01):138-148.
[11] Lin J, Gan C, Han S. Tsm: Temporal shift module for efficient video understanding[C].Proceedings of the IEEE/CVF international conference on computer vision. 2019: 7083-7093.
[12] Majumder S, Kehtarnavaz N. Vision and inertial sensing fusion for human action recognition: A review[J]. IEEE Sensors Journal, 2020, 21(3): 2454-2467.
[13] Wang L, Huynh D Q, Koniusz P. A comparative review of recent kinect-based action recognition algorithms[J]. IEEE Transactions on Image Processing, 2019, 29: 15-28.
[14] Z. Tu, J. Zhang, H. Li, Y. Chen, J. Yuan. Joint-bone Fusion Graph Convolutional Network for Semi-supervised Skeleton Action Recognition. IEEE Transactions on Multimedia, vol.25, pp.1819-1831, 2023.
[15] Li Z, Gavrilyuk K, Gavves E, et al. Videolstm convolves, attends and flows for action recognition[J]. Computer Vision and Image Understanding, 2018, 166: 41-50.
[16] 胡正平, 刁鹏成, 张瑞雪, 等. 3D 多支路聚合轻量网络视频行为识别算法研究[J]. 电子学报, 2020, 48(7): 1261 –1268.
[17] 谢昭,周义,吴克伟,张顺然.基于时空关注度 LSTM 的行为识别[J].计算机学报,2021,44(02):261-274.
[18] 张小俊,李辰政,孙凌宇,张明路.基于改进 3D 卷积神经网络的行为识别 [J]. 计算机集成制造系统,2019,25(08):2000-2006.
[19] 于明理. 基于三维卷积神经网络的实时视频动作分类关键技术研究[D].北京邮电大学,2019.
[20] 石祥滨,李怡颖,刘芳,代钦.T-STAM：基于双流时空注意力机制的端到端的动作识别模型[J].计算机应用研究,2021,38(04):1235-1239+1276.
[21] Hara K, Kataoka H, Satoh Y. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?[C].Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2018: 6546-6555.
[22] Kay W, Carreira J, Simonyan K, et al. The kinetics human action video dataset[J]. arXiv preprint arXiv:1705.06950, 2017.
[23] 范银行，赵海峰，张少杰．基于 3D 卷积残差网络的人体动作识别算法 [J]. 计算机应用研究,2020,37(S2) :300-301,304.
[24] Gao Y, Yang F, Yu Q, et al. Three-dimensional porous Cu@ Cu 2 O aerogels for direct voltammetric sensing of glucose[J]. Microchimica Acta, 2019, 186: 1-9.
[25] Woo S, Park J, Lee J Y, et al. Cbam: Convolutional block attention module[C].Proceedings of the European conference on computer vision (ECCV). 2018: 3-19.
[26] Cai J, Hu J. 3D RANs: 3D residual attention networks for action recognition[J]. The Visual Computer, 2020, 36: 1261-1270.
[27] 高德勇,康自兵,王松,王阳萍.利用卷积块注意力机制识别人体动作的方法 [J]. 西安电子科技大学学报,2022,49(04):144-155+200.
[28] Li Y, Ji B, Shi X, et al. Tea: Temporal excitation and aggregation for action recognition[C].Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 909-918.
[29] Hu J, Shen L, Sun G. Squeeze-and-excitation networks[C].Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 7132-7141.
[30] Soomro K, Zamir A R, Shah M. UCF101: A dataset of 101 human actions classes from videos in the wild[J]. arXiv preprint arXiv:1212.0402, 2012.
[31] Kuehne H, Jhuang H, Garrote E, et al. HMDB: a large video database for human motion recognition[C].2011 International conference on computer vision. IEEE, 2011: 2556-2563.
[32] Z. Wang, Q. She and A. Smolic, ACTION-Net: Multipath Excitation for Action Recognition[C], 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021,13209-13218.
[33] D. Zhuang, M. Jiang, J. Kong, T. Liu. Spatiotemporal attention enhanced features fusion network for action recognition. International Journal of Machine Learning and Cybernetics, volume 12, pp.823–841, 2021.
[34] Carreira J, Zisserman A. Quo vadis, action recognition? a new model and the kinetics dataset[C].proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017: 6299-6308.
[35] Zhu J, Zhu Z, Zou W. End-to-end video-level representation learning for action recognition[C].2018 24th international conference on pattern recognition (ICPR). IEEE, 2018: 645-650.
[36] Diba A, Fayyaz M, Sharma V, et al. Temporal 3d convnets: New architecture and transfer learning for video classification[J]. arXiv preprint arXiv:1711.08200, 2017.
[37] Wang L, Xiong Y, Wang Z, et al. Temporal segment networks: Towards good practices for deep action recognition[C].European conference on computer vision. Springer, Cham, 2016: 20-36.

Please choose a citation manager

Content to export