作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (9): 180-188. doi: 10.19678/j.issn.1000-3428.0064490

• 图形图像处理 • 上一篇    下一篇

融合多模态数据的人体动作识别方法研究

马亚彤1, 王松1,2, 刘英芳1   

  1. 1. 兰州交通大学 电子与信息工程学院, 兰州 730070;
    2. 甘肃省人工智能与图形图像处理工程研究中心, 兰州 730070
  • 收稿日期:2022-04-18 修回日期:2022-05-29 发布日期:2022-06-10
  • 作者简介:马亚彤(1997—),男,硕士研究生,主研方向为计算机视觉、人体动作识别;王松(通信作者),副教授、博士;刘英芳,硕士研究生。
  • 基金资助:
    国家自然科学基金(62067006);甘肃省自然科学基金(21JR7RA291);甘肃省教育科技创新项目(2021jyjbgs-05);甘肃省高校产业支撑计划项目(2020C-19)。

Research on Human Action Recognition Method by Fusing Multimodal Data

MA Yatong1, WANG Song1,2, LIU Yingfang1   

  1. 1. School of Electronic and Information Engineering, Lanzhou Jiaotong University, Lanzhou 730070, China;
    2. Gansu Provincial Engineering Research Center for Artificial Intelligence and Graphic and Imaging Processing, Lanzhou 730070, China
  • Received:2022-04-18 Revised:2022-05-29 Published:2022-06-10

摘要: 基于多模态融合的人体动作识别技术被广泛研究与应用,其中基于特征级或决策级的融合是在单一级别阶段下进行的,无法将真正的语义信息从数据映射到分类器。提出一种多级多模态融合的人体动作识别方法,使其更适应实际的应用场景。在输入端将深度数据转换为深度运动投影图,并将惯性数据转换成信号图像,通过局部三值模式分别对深度运动图和信号图像进行处理,使每个输入模态进一步转化为多模态。将所有的模态通过卷积神经网络训练进行提取特征,并把提取到的特征通过判别相关分析进行特征级融合。利用判别相关分析最大限度地提高两个特征集中对应特征的相关性,同时消除每个特征集中不同类之间的特征相关性,将融合后的特征作为多类支持向量机的输入进行人体动作识别。在UTD-MHAD和UTD Kinect V2 MHAD两个多模态数据集上的实验结果表明,多级多模态融合框架在两个数据集上的识别精度分别达到99.8%和99.9%,具有较高的识别准确率。

关键词: 人体动作识别, 深度运动图, 惯性传感器, 局部三值模式, 判别相关分析

Abstract: Human action recognition technology based on multimodal fusion has been widely investigated.In this technology, feature-level or decision-level fusion is performed at a single level or stage, where actual semantic information from data cannot be mapped for classification.Hence, this paper proposes a multilevel multimodal fusion human action recognition method that is adaptable to practical application scenarios.First, depth data are converted into Depth Motion Maps(DMM), and the inertial data into signal images at the input end.Subsequently, each input mode is rendered multimodal by processing the depth motion maps and signal image via the Local Ternary Patterns(LTP) mode.Next, all the modalities are trained to extract features by a convolutional neural network, and the extracted features are fused at the feature level via Discriminant Correlation Analysis(DCA), which maximizes the correlation of corresponding features in two feature sets while eliminating feature correlation between different classes in each feature set.Finally, the fused features are used as input to a multiclass support vector machine for human action recognition.Experiments are conducted on two multimodal datasets, UTD-MHAD and UTD Kinect V2 MHAD.The experimental results show that the recognition accuracy of the proposed multilevel multimodal fusion framework is 99.8% and 99.9% on the abovementioned two datasets, respectively, both of which signify high recognition accuracy.

Key words: human action recognition, Depth Motion Maps(DMM), inertial sensor, Local Ternary Patterns(LTP), Discriminant Correlation Analysis(DCA)

中图分类号: