基于注意力机制的3D DenseNet人体动作识别方法

doi:10.19678/j.issn.1000-3428.0059640

摘要/Abstract

摘要： 传统人体动作识别算法无法充分利用视频中人体动作的时空信息，且识别准确率较低。提出一种新的三维密集卷积网络人体动作识别方法。将双流网络作为基本框架，在空间网络中运用添加注意力机制的三维密集网络提取视频中动作的表观信息特征，结合时间网络对连续视频序列运动光流的运动信息进行特征提取，经过时空特征和分类层的融合后得到最终的动作识别结果。同时为更准确地提取特征并对时空网络之间的相互作用进行建模，在双流网络之间加入跨流连接对时空网络进行卷积层的特征融合。在UCF101和HMDB51数据集上的实验结果表明，该模型识别准确率分别为94.52%和69.64%，能够充分利用视频中的时空信息，并提取运动的关键信息。

关键词: 动作识别, 注意力机制, 三维DenseNet, 双流网络, 特征融合

Abstract: Traditional human motion recognition algorithms cannot fully utilize the spatial and temporal information of human motions in videos,and are limited in the recognition accuracy.To address the problem,a three-dimensional dense convolutional network is proposed for recognizing human motions in videos.The model takes the two-stream network as the basic framework,using the 3D dense network with an attention mechanism for the spatial network to extract the apparent information features of human motions in the videos.On this basis,the temporal network is also used to extract the motion information features of the optical flows in the continuous video sequence.Then the spatio-temporal features and the classification layer are fused to obtain the final motion recognition accuracy.To extract features more accurately and model the interactions between the spatio-temporal networks,cross-stream connections are added between the two-stream networks to fuse the features at the convolutional layer of spatio-temporal networks.The experimental results show that the proposed model exhibits a recognition accuracy of 94.52% on the UCF101 dataset and 69.64% on the HMDB51 dataset.The model can make full use of the spatio-temporal information in the video to extract the key information of motions.

Key words: motion recognition, attention mechanism, 3D DenseNet, two-stream network, feature fusion

中图分类号:

TP391

张聪聪, 何宁, 孙琪翔, 尹晓杰. 基于注意力机制的3D DenseNet人体动作识别方法[J]. 计算机工程, 2021, 47(11): 313-320.

ZHANG Congcong, HE Ning, SUN Qixiang, YIN Xiaojie. Human Motion Recognition Method Based on Attention Mechanism of 3D DenseNet[J]. Computer Engineering, 2021, 47(11): 313-320.

https://www.ecice06.com/CN/Y2021/V47/I11/313

图/表 15

20211113150007

20211113150010

20211113150012

20211113150015

20211113150018

20211113150021

20211113150025

20211113150028

20211113150032

20211113150035

20211113150037

20211113150041

20211113150045

20211113150048

20211113150052

参考文献

[1] BEN M A,ZAGROUBA E.Abnormal behavior recognition for intelligent video surveillance systems:a review[J].Expert Systems with Applications,2018,91(1):480-491.
[2] WANG P C,LI W Q,OGUNBONA P,et al.RGB-D-based human motion recognition with deep learning:a survey[J].Computer Vision and Image Understanding,2018,9(1):1-22.
[3] PRESTI L L,CASCIA M L.3D skeleton-based human action classification:a survey[J].Pattern Recognition,2015,53:130-147.
[4] WANG L,HUYNH D Q,KONIUSZ P.A comparative review of recent kinect-based action recognition algorithms[J].IEEE Transactions on Image Processing,2020,29(3):15-28.
[5] SAIF S,TEHSEEN S,KAUSAR S.A survey of the techniques for the identification and classification of human actions from visual data[J].Sensors,2018,18(11):39-49.
[6] MONTES A,SALVADOR A,PASCUAL S,et al.Temporal activity detection in untrimmed videos with recurrent neural networks[EB/OL].[2020-08-10].https://arxiv.org/pdf/1608.08128.pdf.
[7] 张瑞,李其申,储珺.基于3D卷积神经网络的人体动作识别算法[J].计算机工程,2019,45(1):259-263. ZHANG R,LI Q S,CHU J.Human action recognition algorithm based on 3D convolution neural network[J]. Computer Engineering,2019,45(1):259-263.(in Chinese)
[8] SIMONYAN K,ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[EB/OL].[2020-08-10].https://arxiv.org/abs/1406.2199.
[9] TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:4489-4497.
[10] ZHANG B,WANG L,WANG Z,et al.Real-time action recognition with deeply transferred motion vector CNNs[C]//Proceedings of 2016 IEEE International Conference on Computer Vision and Pattern Recognition. Las Vegas,USA:IEEE Press,2018:1-10.
[11] FEICHTENHOFER C,PINZ A,ZISSERMAN A.Convolutional two-stream network fusion for video action recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:1933-1941.
[12] WANG L M,XIONG Y J,WANG Z,et al.Temporal segment networks:towards good practices for deep action recognition[C]//Proceedings of IEEE ECCV'16.Washington D.C.,USA:IEEE Press,2016:20-36.
[13] LAN Z,ZHU Y,HAUPTMANN A G.Deep local video feature for action recognition[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:1219-1225.
[14] ZHOU B,ANDONIAN A,TORRALBA A.Temporal relational reasoning in videos[EB/OL].[2020-08-10].https://arxiv.org/pdf/1711.08496.pdf.
[15] WANG X,HU J F,LAI J H,et al.Progressive teacher-student learning for early action prediction[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2019:235-247.
[16] JI S,YANG M,YU K.3D convolutional neural networks for human action recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[17] CARREIRA J,ZISSERMAN A.Quo vadis,action recognition? A new model and the kinetics dataset[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition.Honolulu,USA:IEEE Press,2017:4724-4733.
[18] QIU Z,YAO T,MEI T.Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2017:367-381.
[19] TRAN D,WANG H,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2018:365-378.
[20] WANG X,GIRSHICK R,GUPTA A,et al.Non-local neural networks[C]//Proceedings of IEEE CVPR'17.Washington D.C.,USA:IEEE Press,2017:443-456.
[21] ZHU J,ZOU W,ZHU Z.End-to-end video-level representation learning for action recognition[C]//Proceedings of the 24th International Conference on Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:234-245.
[22] CHAUDHARI S,POLATKAN G,RAMANATH R,et al.An attentive survey of attention models[EB/OL].[2020-08-10].https://arxiv.org/abs/1904.02874v3.
[23] ZHANG J,XIE Y,XIA Y,et al.Attention residual learning for skin lesion classification[J].IEEE Transactions on Medical Imaging,2019,38(9):2092-2103.
[24] MAX J,SIMONYAN K,ZISSERMAN A.Spatial transformer networks[C]//Proceedings of ECCV'15.Berlin,Germany:Springer,2015:2017-2025.
[25] HU J,SHEN L,ALBANIE S,et al.Squeeze-and-excitation networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2020,42(8):2011-2023.
[26] WANG F,JIANG M,QIAN C,et al.Residual attention network for image classification[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:189-197.
[27] TU Z G,LI H Y,ZHANG D J,et al.Action-stage emphasized spatio-temporal VLAD for video action recognition[J].IEEE Transactions on Image Processing,2019,28(6):2799-2812.
[28] WOO S,PARK J,LEE J Y,et al. CBAM:convolutional block attention module[C]//Proceedings of International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2018:3-19.
[29] SOOMRO K,ZAMIR A R,SHAH M.Ucf101:a dataset of 101 human actions classes from videos in the wild[EB/OL].[2020-08-10].https://arxiv.org/pdf/1212.0402.pdf.
[30] KUEHNE H,JHUANG H,STIEFELHAGEN R,et al.HMDB:a large video database for human motion recognition[C]//Proceedings of International Conference on High Performance Computing in Science and Engineering.Berlin,Germany:Springer,2013:571-582.
[31] DENG J,DONG W,SOCHER R,et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Computer Society,2009:248-255.
[32] PÉREZ J S,MEINHARDT-LLOPIS E,FACCIOLO G.TV-L1 optical flow estimation[J].Image Processing on Line,2013(3):137-150.
[33] ZINKEVICH M,WEIMER M,SMOLA A J,et al.Parallelized stochastic gradient descent[C]//Proceedings of IEEE DBLP'11.Washington D.C.,USA:IEEE Press,2011:485-496.
[34] DIBA A,FAYYAZ M,SHARMA V,et al.Temporal 3D ConvNets:new architecture and transfer learning for video classification[EB/OL].[2020-08-10].https://arxiv.org/pdf/1711.08200.pdf.

选择文件类型/文献管理软件名称

选择包含的内容