基于知识蒸馏的多模态融合行为识别方法

doi:10.19678/j.issn.1000-3428.0065152

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 280-288, 297. doi: 10.19678/j.issn.1000-3428.0065152

基于知识蒸馏的多模态融合行为识别方法

詹健浩¹, 甘利鹏¹, 毕永辉², 曾鹏³, 李晓潮¹^,*

1. 厦门大学电子科学与技术学院, 福建厦门 361005
2. 厦门市美亚柏科信息股份有限公司, 福建厦门 361016
3. 厦门市公安局, 福建厦门 361104

收稿日期:2022-07-05 出版日期:2023-10-15 发布日期:2023-10-10
通讯作者: 李晓潮
作者简介:
詹健浩（1997—），男，硕士研究生，主研方向为深度学习、行为识别
甘利鹏，硕士研究生
毕永辉，学士
曾鹏，警务技术中级、学士
基金资助:
福建省高校产学研联合创新项目(2022H6004); 集成电路设计与测试分析福建省高校重点实验室基金; 厦门大学马来西亚研究基金(XMUMRF/2019-C4/IECE/0008)

Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation

Jianhao ZHAN¹, Lipeng GAN¹, Yonghui BI², Peng ZENG³, Xiaochao LI¹^,*

1. School of Electronic Science and Engineering, Xiamen University, Xiamen 361005, Fujian, China
2. Xiamen Meiya Pico Information Co., Ltd., Xiamen 361016, Fujian, China
3. Xiamen Public Security Bureau, Xiamen 361104, Fujian, China

Received:2022-07-05 Online:2023-10-15 Published:2023-10-10
Contact: Xiaochao LI

摘要/Abstract

摘要：

有效利用多模态数据的不同特征能够提高行为识别性能, 其核心问题在于多模态融合, 主要包括在数据层面、特征层面和预测分数层面融合不同模态数据的特征信息。研究在特征和预测分数2个层面通过多教师知识蒸馏的多模态融合方法, 将多模态数据的互补特征迁移到RGB网络, 以及采用不同知识蒸馏损失函数和模态组合的行为识别效果。提出一种基于知识蒸馏的多模态行为识别方法, 通过在特征上采用MSE损失函数、在预测分数上采用KL散度进行知识蒸馏, 并采用原始的骨骼模态和光流模态的教师网络的组合进行多模态融合, 使RGB学生网络同时学习到光流和骨骼教师网络的特征语义信息和预测分布信息, 从而提高识别准确率。实验结果表明, 该方法在常用的多模态数据集NTU RGB+D 60、UTD-MHAD和N-UCLA以及单模态数据集HMDB51上分别达到90.09%、95.12%、97.82%和81.26%的准确率, 在UTD-MHAD数据集上的识别准确率相比于单模态RGB数据分别提升3.49、2.54、3.21和7.34个百分点。

关键词: 行为识别, 知识蒸馏, 多模态融合, 深度学习, 多教师网络

Abstract:

The multi-modality fusion method is a core technique for effectively exploring complementary features from multiple modalities to improve action recognition performance at data-, feature-, and decision-level fusion. This study mainly investigated the multimodality fusion method at the feature and decision levels through knowledge distillation, transferring feature learning from other modalities to the RGB model, including the effects of different loss functions and fusion strategies. A multi-modality distillation fusion method is proposed for action recognition, whereby knowledge distillation is performed using the MSE loss function at the feature level, KL divergence at the decision-prediction level, and a combination of the original skeleton and optical flow modalities as multi-teacher networks so that the RGB student network can simultaneously learn with better recognition accuracy. Extensive experiments show that the proposed method achieved state-of-the-art performance with 90.09%, 95.12%, 97.82%, and 81.26% accuracies on the NTU RGB+D 60, UTD-MHAD, N-UCLA, and HMDB51 datasets, respectively. The recognition accuracy on the UTD-MHAD dataset has increased by 3.49, 2.54, 3.21, and 7.34 percentage points compared to single mode RGB data, respectively.

Key words: action recognition, knowledge distillation, multi-modality fusion, deep learning, multi-teacher network

詹健浩, 甘利鹏, 毕永辉, 曾鹏, 李晓潮. 基于知识蒸馏的多模态融合行为识别方法[J]. 计算机工程, 2023, 49(10): 280-288, 297.

Jianhao ZHAN, Lipeng GAN, Yonghui BI, Peng ZENG, Xiaochao LI. Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation[J]. Computer Engineering, 2023, 49(10): 280-288, 297.

http://www.ecice06.com/CN/Y2023/V49/I10/280

图/表 13

图1 基于知识蒸馏的多模态融合行为识别方法

Fig.1 Action recognition method with multi-modality fusion based on knowledge distillation

图2 单模态模型与多模态知识蒸馏模型的可视化图

Fig.2 Visualization charts of single-modality models and multi-modality models based on knowledge distillation

参考文献 32

1	REN Z L, ZHANG Q S, GAO X Y, et al. Multi-modality learning for human action recognition. Multimedia Tools and Applications, 2021, 80 (11): 16185- 16203. doi: 10.1007/s11042-019-08576-z
2	VIELZEUF V, LECHERVY A, PATEUX S, et al. CentralNet: a multilayer approach for multimodal fusion[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2019: 575-589.
3	周雪雪, 雷景生, 卓佳宁. 基于多模态特征学习的人体行为识别方法. 计算机系统应用, 2021, 30 (4): 146- 152. URL
	ZHOU X X, LEI J S, ZHUO J N. Human action recognition algorithm based on multi-modal features learning. Computer Systems & Applications, 2021, 30 (4): 146- 152. URL
4	SUN Z H, KE Q H, RAHMANI H, et al. Human action recognition from various data modalities: a review[EB/OL]. [2022-09-10]. https://arxiv.org/abs/2012.11866.
5	LI Y C, LIU Y, ZHANG C. What elements are essential to recognize human actions?[EB/OL]. [2022-09-10]. https://april.zju.edu.cn/wp-content/papercite-data/pdf/li2019whatea.pdf.
6	XU C, WU X, LI Y C, et al. Cross-modality online distillation for multi-view action recognition. Neurocomputing, 2021, 456, 384- 393. doi: 10.1016/j.neucom.2021.05.077
7	CRASTO N, WEINZAEPFEL P, ALAHARI K, et al. MARS: motion-augmented RGB stream for action recognition[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 7874-7883.
8	STROUD J C, ROSS D A, SUN C, et al. D3D: distilled 3D networks for video action recognition[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2020: 614-623.
9	LI Y X, LU Z C, XIONG X H, et al. PERF-Net: pose empowered RGB-flow net[C]//Proceedings of IEEE/CVF Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2022: 798-807.
10	KENDALL A, GRIMES M, CIPOLLA R. PoseNet: a convolutional network for real-time 6-DOF camera relocalization[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2016: 2938-2946.
11	WU M C, CHIU C T, WU K H. Multi-teacher knowledge distillation for compressed video action recognition on deep neural networks[C]//Proceedings of 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2019: 2202-2206.
12	SHAHROUDY A, LIU J, NG T T, et al. NTU RGB+D: a large scale dataset for 3D human activity analysis[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 1010-1019.
13	CHEN C, JAFARI R, KEHTARNAVAZ N. UTD-MHAD: a multimodal dataset for human action recognition utilizing a depth camera and a wearable inertial sensor[C]//Proceedings of IEEE International Conference on Image Processing. Washington D. C., USA: IEEE Press, 2015: 168-172.
14	WANG J, NIE X H, XIA Y, et al. Cross-view action modeling, learning, and recognition[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2014: 2649-2656.
15	KUEHNE H, JHUANG H, GARROTE E, et al. HMDB: a large video database for human motion recognition[C]//Proceedings of International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2012: 2556-2563.
16	DAVOODIKAKHKI M, YIN K K. Hierarchical action classification with network pruning[C]//Proceedings of International Symposium on Visual Computing. Berlin, Germany: Springer, 2020: 291-305.
17	CAO Z, SIMON T, WEI S H, et al. Realtime multi-person 2D pose estimation using part affinity fields[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 1302-1310.
18	HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6546-6555.
19	WU H B, MA X, LI Y B. Spatiotemporal multimodal learning with 3D CNNs for video action recognition. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32 (3): 1250- 1261. doi: 10.1109/TCSVT.2021.3077512
20	DHIMAN C, VISHWAKARMA D K. View-invariant deep architecture for human action recognition using two-stream motion and shape temporal dynamics. IEEE Transactions on Image Processing, 2020, 29, 3835- 3844. doi: 10.1109/TIP.2020.2965299
21	SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 618-626.
22	PEREZ-RUA J M, VIELZEUF V, PATEUX S, et al. MFAS: multimodal fusion architecture search[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 6959-6968.
23	JOZE H R V, SHABAN A, IUZZOLINO M L, et al. MMTM: multimodal transfer module for CNN fusion[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 13286-13296.
24	MOON G, KWON H, LEE K M, et al. IntegralAction: pose-driven feature integration for robust human action recognition in videos[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 3334-3343.
25	DE BOISSIERE A M, NOUMEIR R. Infrared and 3D skeleton feature fusion for RGB-D action recognition. IEEE Access, 2020, 8, 168297- 168308. doi: 10.1109/ACCESS.2020.3023599
26	QIU Z F, YAO T, NGO C W, et al. Learning spatio-temporal representation with local and global diffusion[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 12048-12057.
27	LIU M Y, YUAN J S. Recognizing human actions as the evolution of pose estimation maps[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 1159-1168.
28	XU W Y, WU M Q, ZHAO M, et al. Fusion of skeleton and RGB features for RGB-D human action recognition. IEEE Sensors Journal, 2021, 21 (17): 19157- 19164. doi: 10.1109/JSEN.2021.3089705
29	ISLAM M M, IQBAL T. HAMLET: a hierarchical multimodal attention-based human activity recognition algorithm[C]//Proceedings of IEEE/RSJ International Conference on Intelligent Robots and Systems. Washington D. C., USA: IEEE Press, 2021: 10285-10292.
30	DAS S, SHARMA S, DAI R, et al. VPN: learning video-pose embedding for activities of daily living[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 72-90.
31	WANG Y C, XIAO Y, XIONG F, et al. 3DV: 3D dynamic voxel for action recognition in depth video[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 508-517.
32	ZHU J G, ZOU W, ZHU Z, et al. Action machine: toward person-centric action recognition in videos. IEEE Signal Processing Letters, 2019, 26 (11): 1633- 1637. doi: 10.1109/LSP.2019.2942739

[1]	池亚平, 岳梓岩, 林雨衡. 基于Transformer的SM4算法工作模式识别[J]. 计算机工程, 2023, 49(9): 109-117.
[2]	林中霖, 时金桥, 王美琪, 王学宾, 王雨燕. 基于应用行为划分的Android恶意应用检测技术[J]. 计算机工程, 2023, 49(9): 125-136.
[3]	江雨燕, 陶承凤, 李平. 数据增强和自适应自步学习的深度子空间聚类算法[J]. 计算机工程, 2023, 49(8): 96-103, 110.
[4]	李泽水, 冀俊忠, 杨翠翠. 基于边权重信息深度网络嵌入的PPIN功能模块检测[J]. 计算机工程, 2023, 49(8): 69-76.
[5]	王可铮, 徐玉芬, 周尚波. 结合对比感知损失和融合注意力的图像去雾模型[J]. 计算机工程, 2023, 49(8): 207-214.
[6]	刘俊豪, 王美林, 谢兴, 宋烨兴, 许莉花. 基于改进YOLOv5的皮革瑕疵检测算法[J]. 计算机工程, 2023, 49(8): 240-249.
[7]	曹坪, 杨怀志, 薄一军, 尤嘉, 张淳杰, 李丹勇. 面向低质量裂缝图像的多知识蒸馏分类[J]. 计算机工程, 2023, 49(7): 204-213.
[8]	闫兴亚, 匡娅茜, 白光睿, 李月. 基于深度学习的学生课堂行为识别方法[J]. 计算机工程, 2023, 49(7): 251-258.
[9]	李军侠, 王星驰, 殷梓, 石德硕. 边缘深度挖掘的弱监督显著性目标检测[J]. 计算机工程, 2023, 49(7): 169-178.
[10]	吴珊, 周凤. 基于改进SSD算法的小目标检测[J]. 计算机工程, 2023, 49(7): 179-188.
[11]	席建锐, 唐红梅, 梁春阳, 刘鑫. 基于改进隐函数的点云物体重建[J]. 计算机工程, 2023, 49(7): 214-222.
[12]	齐咏生, 杜晓旭, 朱俊峰, 高胜利, 刘利强. 基于增强型轻量深度网络的牧区牲畜高效检测[J]. 计算机工程, 2023, 49(7): 278-287.
[13]	谌雨章, 黄逸姿, 张钧涵. 基于多速率空洞卷积的多尺度水下小目标检测[J]. 计算机工程, 2023, 49(6): 257-264.
[14]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[15]	于海洋, 景鹏, 张文涛, 谢赛飞, 滑志华, 宋草原. 基于残差与注意力机制的道路裂缝检测U-Net改进模型[J]. 计算机工程, 2023, 49(6): 265-273.

选择文件类型/文献管理软件名称

选择包含的内容

基于知识蒸馏的多模态融合行为识别方法

Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 32

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于知识蒸馏的多模态融合行为识别方法

Action Recognition Method with Multi-Modality Fusion Based on Knowledge Distillation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 32

相关文章 15

编辑推荐

Metrics

本文评价