作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 280-288, 297. doi: 10.19678/j.issn.1000-3428.0065152

• 开发研究与工程应用 • 上一篇    下一篇

基于知识蒸馏的多模态融合行为识别方法

詹健浩1, 甘利鹏1, 毕永辉2, 曾鹏3, 李晓潮1,*   

  1. 1. 厦门大学 电子科学与技术学院, 福建 厦门 361005
    2. 厦门市美亚柏科信息股份有限公司, 福建 厦门 361016
    3. 厦门市公安局, 福建 厦门 361104
  • 收稿日期:2022-07-05 出版日期:2023-10-15 发布日期:2023-10-10
  • 通讯作者: 李晓潮
  • 作者简介:

    詹健浩(1997—),男,硕士研究生,主研方向为深度学习、行为识别

    甘利鹏,硕士研究生

    毕永辉,学士

    曾鹏,警务技术中级、学士

  • 基金资助:
    福建省高校产学研联合创新项目(2022H6004); 集成电路设计与测试分析福建省高校重点实验室基金; 厦门大学马来西亚研究基金(XMUMRF/2019-C4/IECE/0008)

Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation

Jianhao ZHAN1, Lipeng GAN1, Yonghui BI2, Peng ZENG3, Xiaochao LI1,*   

  1. 1. School of Electronic Science and Engineering, Xiamen University, Xiamen 361005, Fujian, China
    2. Xiamen Meiya Pico Information Co., Ltd., Xiamen 361016, Fujian, China
    3. Xiamen Public Security Bureau, Xiamen 361104, Fujian, China
  • Received:2022-07-05 Online:2023-10-15 Published:2023-10-10
  • Contact: Xiaochao LI

摘要:

有效利用多模态数据的不同特征能够提高行为识别性能, 其核心问题在于多模态融合, 主要包括在数据层面、特征层面和预测分数层面融合不同模态数据的特征信息。研究在特征和预测分数2个层面通过多教师知识蒸馏的多模态融合方法, 将多模态数据的互补特征迁移到RGB网络, 以及采用不同知识蒸馏损失函数和模态组合的行为识别效果。提出一种基于知识蒸馏的多模态行为识别方法, 通过在特征上采用MSE损失函数、在预测分数上采用KL散度进行知识蒸馏, 并采用原始的骨骼模态和光流模态的教师网络的组合进行多模态融合, 使RGB学生网络同时学习到光流和骨骼教师网络的特征语义信息和预测分布信息, 从而提高识别准确率。实验结果表明, 该方法在常用的多模态数据集NTU RGB+D 60、UTD-MHAD和N-UCLA以及单模态数据集HMDB51上分别达到90.09%、95.12%、97.82%和81.26%的准确率, 在UTD-MHAD数据集上的识别准确率相比于单模态RGB数据分别提升3.49、2.54、3.21和7.34个百分点。

关键词: 行为识别, 知识蒸馏, 多模态融合, 深度学习, 多教师网络

Abstract:

The multi-modality fusion method is a core technique for effectively exploring complementary features from multiple modalities to improve action recognition performance at data-, feature-, and decision-level fusion. This study mainly investigated the multimodality fusion method at the feature and decision levels through knowledge distillation, transferring feature learning from other modalities to the RGB model, including the effects of different loss functions and fusion strategies. A multi-modality distillation fusion method is proposed for action recognition, whereby knowledge distillation is performed using the MSE loss function at the feature level, KL divergence at the decision-prediction level, and a combination of the original skeleton and optical flow modalities as multi-teacher networks so that the RGB student network can simultaneously learn with better recognition accuracy. Extensive experiments show that the proposed method achieved state-of-the-art performance with 90.09%, 95.12%, 97.82%, and 81.26% accuracies on the NTU RGB+D 60, UTD-MHAD, N-UCLA, and HMDB51 datasets, respectively. The recognition accuracy on the UTD-MHAD dataset has increased by 3.49, 2.54, 3.21, and 7.34 percentage points compared to single mode RGB data, respectively.

Key words: action recognition, knowledge distillation, multi-modality fusion, deep learning, multi-teacher network