Research on Human Action Recognition Method by Fusing Multimodal Data

doi:10.19678/j.issn.1000-3428.0064490

Abstract

Abstract: Human action recognition technology based on multimodal fusion has been widely investigated.In this technology, feature-level or decision-level fusion is performed at a single level or stage, where actual semantic information from data cannot be mapped for classification.Hence, this paper proposes a multilevel multimodal fusion human action recognition method that is adaptable to practical application scenarios.First, depth data are converted into Depth Motion Maps(DMM), and the inertial data into signal images at the input end.Subsequently, each input mode is rendered multimodal by processing the depth motion maps and signal image via the Local Ternary Patterns(LTP) mode.Next, all the modalities are trained to extract features by a convolutional neural network, and the extracted features are fused at the feature level via Discriminant Correlation Analysis(DCA), which maximizes the correlation of corresponding features in two feature sets while eliminating feature correlation between different classes in each feature set.Finally, the fused features are used as input to a multiclass support vector machine for human action recognition.Experiments are conducted on two multimodal datasets, UTD-MHAD and UTD Kinect V2 MHAD.The experimental results show that the recognition accuracy of the proposed multilevel multimodal fusion framework is 99.8% and 99.9% on the abovementioned two datasets, respectively, both of which signify high recognition accuracy.

Key words: human action recognition, Depth Motion Maps(DMM), inertial sensor, Local Ternary Patterns(LTP), Discriminant Correlation Analysis(DCA)

摘要： 基于多模态融合的人体动作识别技术被广泛研究与应用，其中基于特征级或决策级的融合是在单一级别阶段下进行的，无法将真正的语义信息从数据映射到分类器。提出一种多级多模态融合的人体动作识别方法，使其更适应实际的应用场景。在输入端将深度数据转换为深度运动投影图，并将惯性数据转换成信号图像，通过局部三值模式分别对深度运动图和信号图像进行处理，使每个输入模态进一步转化为多模态。将所有的模态通过卷积神经网络训练进行提取特征，并把提取到的特征通过判别相关分析进行特征级融合。利用判别相关分析最大限度地提高两个特征集中对应特征的相关性，同时消除每个特征集中不同类之间的特征相关性，将融合后的特征作为多类支持向量机的输入进行人体动作识别。在UTD-MHAD和UTD Kinect V2 MHAD两个多模态数据集上的实验结果表明，多级多模态融合框架在两个数据集上的识别精度分别达到99.8%和99.9%，具有较高的识别准确率。

关键词: 人体动作识别, 深度运动图, 惯性传感器, 局部三值模式, 判别相关分析

CLC Number:

TP391

MA Yatong, WANG Song, LIU Yingfang. Research on Human Action Recognition Method by Fusing Multimodal Data[J]. Computer Engineering, 2022, 48(9): 180-188.

马亚彤, 王松, 刘英芳. 融合多模态数据的人体动作识别方法研究[J]. 计算机工程, 2022, 48(9): 180-188.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064490

http://www.ecice06.com/EN/Y2022/V48/I9/180

Figures/Tables 16

References

[1] MAJUMDER S, KEHTARNAVAZ N.Vision and inertial sensing fusion for human action recognition:a review[J].IEEE Sensors Journal, 2021, 21(3):2454-2467.
[2] CHEN C, JAFARI R, KEHTARNAVAZ N.Improving human action recognition using fusion of depth camera and inertial sensors[J].IEEE Transactions on Human-Machine Systems, 2015, 45(1):51-61.
[3] ROITBERG A, POLLERT T, HAURILET M, et al.Analysis of deep fusion strategies for multi-modal gesture recognition[C]//Proceedings of 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2019:198-206.
[4] AHMAD Z, KHAN N.CNN-based multistage gated average fusion for human action recognition using depth and inertial sensors[J].IEEE Sensors Journal, 2021, 21(3):3623-3634.
[5] CHEN C, JAFARI R, KEHTARNAVAZ N.A real-time human action recognition system using depth and inertial sensor fusion[J].IEEE Sensors Journal, 2016, 16(3):773-781.
[6] CHEN C, HAO H Y, JAFARI R, et al.Weighted fusion of depth and inertial data to improve view invariance for real-time human action recognition[C]//Proceedings of SPIEʼ17.Washington D.C., USA:IEEE Press, 2017:43-51.
[7] DAWAR N, KEHTARNAVAZ N.A convolutional neural network-based sensor fusion system for monitoring transition movements in healthcare applications[C]//Proceedings of the 14th IEEE International Conference on Control and Automation.Washington D.C., USA:IEEE Press, 2018:482-485.
[8] DAWAR N, KEHTARNAVAZ N.Action detection and recognition in continuous action streams by deep learning-based sensing fusion[J].IEEE Sensors Journal, 2018, 18(23):9660-9668.
[9] DAWAR N, OSTADABBAS S, KEHTARNAVAZ N.Data augmentation in deep learning-based fusion of depth and inertial sensing for action recognition[J].IEEE Sensors Letters, 2019, 3(1):1-4.
[10] LIU K, CHEN C, JAFARI R, et al.Fusion of inertial and depth sensor data for robust hand gesture recognition[J].IEEE Sensors Journal, 2014, 14(6):1898-1903.
[11] TU Z G, XIE W, QIN Q Q, et al.Multi-stream CNN:learning representations based on human-related regions for action recognition[J].Pattern Recognition, 2018, 79:32-43.
[12] HWANG I, CHA G, OH S.Multi-modal human action recognition using deep neural networks fusing image and inertial sensor data[C]//Proceedings of 2017 IEEE International Conference on Multi-Sensor Fusion and Integration for Intelligent Systems.Washington D.C., USA:IEEE Press, 2017:278-283.
[13] KAMEL A, SHENG B, YANG P, et al.Deep convolutional neural networks for human action recognition using depth maps and postures[J].IEEE Transactions on Systems, Man, and Cybernetics:Systems, 2019, 49(9):1806-1819.
[14] LI H B, SHRESTHA A, FIORANELLI F, et al.Mult-isensor data fusion for human activities classification and fall detection[C]//Proceedings of 2017 IEEE SENSORSʼ17.Washington D.C., USA:IEEE Press, 2017:1-3.
[15] RAMACHANDRAM D, TAYLOR G W.Deep multimodal learning:a survey on recent advances and trends[J].IEEE Signal Processing Magazine, 2017, 34(6):96-108.
[16] AHMAD Z, KHAN N.Towards improved human action recognition using convolutional neural networks and multimodal fusion of depth and inertial sensor data[C]//Proceedings of 2018 IEEE International Symposium on Multimedia.Washington D.C., USA:IEEE Press, 2018:223-230.
[17] AHMAD Z, KHAN N.Human action recognition using deep multilevel multimodal (M2) fusion of depth and inertial sensors[J].IEEE Sensors Journal, 2020, 20(3):1445-1455.
[18] EHATISHAM-UL-HAQ M, JAVED A, AZAM M A, et al.Robust human activity recognition using multimodal feature-level fusion[J].IEEE Access, 2019, 7:60736-60751.
[19] RADU V, TONG C, BHATTACHARYA S, et al.Multimodal deep learning for activity and context recognition[C]//Proceedings of ACM Conference on Interactive, Mobile, Wearable and Ubiquitous Technologies.New York, USA:ACM Press, 2018:1-27.
[20] HAGHIGHAT M, ABDEL-MOTTALEB M, ALHALABI W.Discriminant correlation analysis:real-time feature level fusion for multimodal biometric recognition[J].IEEE Transactions on Information Forensics and Security, 2016, 11(9):1984-1996.
[21] CHEN C, LIU K, KEHTARNAVAZ N.Real-time human action recognition based on depth motion maps[J].Journal of Real-Time Image Processing, 2016, 12(1):155-163.
[22] TAN X Y, TRIGGS B.Enhanced local texture feature sets for face recognition under difficult lighting conditions[J].IEEE Transactions on Image Process, 2010, 19(6):1635-1650.
[23] JIANG W C, YIN Z Z.Human activity recognition using wearable sensors by deep convolutional neural networks[C]//Proceedings of the 23rd ACM International Conference on Multimedia.New York, USA:ACM Press, 2015:1307-1310.
[24] YANG X D, ZHANG C Y, TIAN Y L.Recognizing actions using depth motion maps-based histograms of oriented gradients[C]//Proceedings of the 20th ACM International Conference on Multimedia.New York, USA:ACM Press, 2012:1057-1060.
[25] HARDOON D R, SZEDMAK S, SHAWE-TAYLOR J.Canonical correlation analysis:an overview with application to learning methods[J].Neural Computation, 2004, 16(12):2639-2664.
[26] CHEN C, JAFARI R, KEHTARNAVAZ N.UTD-MHAD:a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor[C]//Proceedings of 2015 IEEE International Conference on Image Processing.Washington D.C., USA:IEEE Press, 2015:168-172.
[27] Kinect2d dataset[EB/OL].[2022-03-10].https://personal.utdallas.edu/~kehtar/Kinect2DatasetReadme.pdf.
[28] BULBUL M F, JIANG Y S, MA J W.DMMs-based multiple features fusion for human action recognition[J].International Journal of Multimedia Data Engineering and Management, 2015, 6(4):23-39.
[29] HAFEEZ S, JALAL A, KAMAL S.Multi-fusion sensors for action recognition based on discriminative motion cues and random forest[C]//Proceedings of 2021 International Conference on Communication Technologies.Washington D.C., USA:IEEE Press, 2021:91-96.
[30] BEN MAHJOUB A, ATRI M.An efficient end-to-end deep learning architecture for activity classification[J].Analog Integrated Circuits and Signal Processing, 2019, 99(1):23-32.
[31] ELMADANY N E D, HE Y F, GUAN L.Multimodal learning for human action recognition via bimodal/multimodal hybrid centroid canonical correlation analysis[J].IEEE Transactions on Multimedia, 2019, 21(5):1317-1331.
[32] YANG T J, HOU Z J, LIANG J Z, et al.Depth sequential information entropy maps and multi-label subspace learning for human action recognition[J].IEEE Access, 2020, 8:135118-135130.
[33] CHEN C, JAFARI R, KEHTARNAVAZ N.Fusion of depth, skeleton, and inertial data for human action recognition[C]//Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2016:2712-2716.

Please choose a citation manager

Content to export