基于时空置换注意力机制的残差行为识别模型

doi:10.19678/j.issn.1000-3428.0068936

摘要/Abstract

摘要：

为提升深度学习模型中三维卷积提取时空特征的有效性, 提出一种基于时空置换注意力(SAT)机制的残差行为识别模型。SAT机制是由通道结合时间和空间注意力子模块组成的轻量化的多维度混合注意力机制, 其在通道注意力中增加了结合时间的维度, 获取时间与通道信息; 在空间注意力中压缩冗余的时间信息, 提升对空间特征的关注度, 对提取的特征进行通道置乱及通道重组, 提升模型对数据的表征能力并减少参数量。该模型使用Resnext残差网络提取时空特征, 在残差模块中嵌入SAT模块, 利用注意力模块自主学习不同特征图的权重参数, 针对性地对提取的特征在通道、时间、空间域加权, 增强网络对人体行为的表达能力, 使用改进的交叉熵函数Focal Loss作为损失函数, 解决数据集中可能存在的样本分布不均衡的问题。实验结果表明, 该模型在UCF101以及HMDB51数据集上的识别准确率分别达到了96.3%以及71.6%, 相较于其他对比模型均有显著提升。

关键词: 深度学习, 行为识别, 时空置换注意力, 残差网络, 交叉熵函数

Abstract:

This paper presents a residual behavior recognition model based on Spatio-temporal Shuffle Attention(SAT) mechanism, to improve the effectiveness of 3D convolution extraction of spatio-temporal features in deep learning models. The SAT mechanism is a lightweight multidimensional hybrid attention mechanism composed of submoudule that combines channels and temporal attention and spatial attention submodule, which adds the dimension of time combination to obtain time and channel information in channel attention. The spatial attention submodule compresses redundant time information, improves the attention to spatial features, carries out channel scrambling and reorganization on extracted features, improves the data representation ability of the model, and reduces the parameter count. In this model, a Resnext residual network is used to extract spatio-temporal features, the spatio-temporal permutation attention mechanism module is embedded into the residual module, and the attention module is used to independently learn the weight parameters of different feature maps. The extracted features are weighted in the channel, time, and space domains to enhance the network's ability to express human behavior, and Focal Loss, which is an improved cross-entropy function, is used as the loss function to solve the uneven sample distribution problem in datasets. Experimental results show that the model achieves a recognition accuracy of 96.3% and 71.6% on the UCF101 and HMDB51 datasets, respectively, which is a significant improvement over other models.

Key words: deep learning, behavior recognition, Spatio-temporal Shuffle Attention (SAT), residual network, cross-entropy function

蒋杰平, 王明文. 基于时空置换注意力机制的残差行为识别模型[J]. 计算机工程, 2025, 51(4): 119-128.

JIANG Jieping, WANG Mingwen. Residual Behavior Recognition Model Based on Spatio-Temporal Shuffle Attention Mechanism[J]. Computer Engineering, 2025, 51(4): 119-128.

https://www.ecice06.com/CN/Y2025/V51/I4/119

图/表 16

图1 网络结构对比

Fig.1 Comparison of network structure

图2 Resnext残差模块结构

Fig.2 Resnext residual module structure

图3 置换注意力机制结构

Fig.3 Structure of shuffle attention mechanism

图4 通道结合时间注意力子模块

Fig.4 Submodule that combines channels and temporal attention

图5 空间注意力子模块

Fig.5 Spatial attention submodule

图6 SAT-Bottlenek结构

Fig.6 Structure of SAT-Bottlenek

图7 SAT-Resnext结构

Fig.7 Structure of SAT-Resnext

图8 实验步骤

Fig.8 Experiment procedure

图9 UCF101训练过程

Fig.9 UCF101 training process

图10 HMDB51训练过程

Fig.10 HMDB51 training process

图11 整体混淆矩阵图

Fig.11 Overall confusion matrix diagram

图12 部分混淆矩阵图

Fig.12 Partial confusion matrix diagram

参考文献 25

1	邓淼磊, 高振东, 李磊, 等. 基于深度学习的人体行为识别综述. 计算机工程与应用, 2022, 58 (13): 14- 26.
	DENG M L , GAO Z D , LI L , et al. Overview of human behavior recognition based on deep learning. Computer Engineering and Applications, 2022, 58 (13): 14- 26.
2	WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2013: 3551-3558.
3	KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2014: 1725-1732.
4	SIMONYAN K, ZISSERMAN A, SIMONYAN K, et al. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 568-576.
5	XIE Z , ZHOU Y , WU K W , et al. Behavior recognition based on spatiotemporal attention LSTM. Journal of Computer Science, 2021, 44 (2): 261- 274.
6	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2015: 4489-4497.
7	QIU Z F, YAO T, MEI T. Learning spatio-temporal representation with pseudo-3D residual networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 5534-5542.
8	DIBA A L, FAYYAZ M, SHARMA V, et al. Temporal 3D ConvNets: new architecture and transfer learning for video classification[EB/OL]. [2023-09-18]. https://arxiv.org/abs/1711.08200v1.
9	HARA K, KATAOKA H, SATOH Y. Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and ImageNet?[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6546-6555.
10	CHAUDHARI S , MITHAL V , POLATKAN G , et al. An attentive survey of attention models. ACM Transactions on Intelligent Systems and Technology, 2021, 12 (5): 1- 32.
11	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 7132-7141.
12	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 3-19.
13	CAO Y, XU J R, LIN S, et al. GCNet: non-local networks meet squeeze-excitation networks and beyond[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2019: 1971-1980.
14	WANG Q, WU B, ZHU P, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2020: 11531-11539.
15	LI X, HU X L, YANG J. Spatial group-wise enhance: improving semantic feature learning in convolutional networks[EB/OL]. [2023-09-18]. https://arxiv.org/abs/1905.09646v2.
16	ZHOU Y D, LI B P, WANG Z H, et al. Integrating temporal and spatial attention for video action recognition[EB/OL]. [2023-09-18]. https://pdfs.semanticscholar.org/dd07/53f2192c52672a9b421c61abf58a47f8de47.pdf.
17	CARVALHO S R, BERTAGNOLLI N M, FOLKMAN T, et al. A temporal bottleneck attention architecture for video action recognition: WO2021US59372[P]. 2022-05-19.
18	LI C H , ZHANG J , YAO J C . Streamer action recognition in live video with spatial-temporal attention and deep dictionary learning. Neurocomputing, 2021, 453, 383- 392. doi: 10.1016/j.neucom.2020.07.148
19	GONG J , LUO C , LUO Q . Action recognition model based on attention mechanism and residual network. Electronic Measurement Technology, 2021, 44 (14): 111- 116.
20	XIE S N, GIRSHICK R, DOLLÁR P, et al. Aggregated residual transformations for deep neural networks[EB/OL]. [2023-09-18]. https://arxiv.org/abs/1611.05431.
21	ZHANG Q L, YANG Y B. SA-Net: shuffle attention for deep convolutional neural networks[EB/OL]. [2023-09-18]. https://arxiv.org/abs/2102.00240.
22	ZHANG X Y, ZHOU X Y, LIN M X, et al. ShuffleNet: an extremely efficient convolutional neural network for mobile devices[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6848-6856.
23	LIN T Y , GOYAL P , GIRSHICK R , et al. Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42 (2): 318- 327. doi: 10.1109/TPAMI.2018.2858826
24	ZHOU B , LI J F . Human behavior recognition combined with object detection. Journal of Automation, 2020, 46 (9): 1961- 1970.
25	WANG L M , XIONG Y J , WANG Z , et al. Temporal segment networks for action recognition in videos. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2019, 41 (11): 2740- 2755.

[1]	杜晨阳, 张雪英, 黄丽霞, 李娟. 基于改进高效通道注意力机制的多特征语音情感识别[J]. 计算机工程, 2025, 51(4): 97-106.
[2]	徐永刚, 孙琦烜, 李凡甲, 程健维, 戴佳俊. 基于扩展时间和时空特征融合图卷积网络的骨架行为识别[J]. 计算机工程, 2025, 51(4): 281-292.
[3]	戴康佳, 徐慧英, 朱信忠, 李悉钰, 黄晓, 陈国强, 张志雄. YGL-SLAM: 动态场景下基于点和线的语义SLAM系统[J]. 计算机工程, 2025, 51(3): 95-104.
[4]	韩鹏, 黄韫栀, 任彩月, 程竞仪, 徐军. 基于双分支网络的乳腺PET新辅助化疗疗效评估[J]. 计算机工程, 2025, 51(3): 293-299.
[5]	胡朝举, 郭凤仪. 基于改进YOLOv7的MODF端口状态检测算法[J]. 计算机工程, 2025, 51(2): 78-85.
[6]	孙浩淼, 李宗民, 肖倩, 孙文洁, 张雯欣. AI-Curling: 一种冰壶现场分析与决策方法[J]. 计算机工程, 2025, 51(2): 102-110.
[7]	孙义康, 高建华. 基于卷积神经网络和长短期记忆的死代码检测方法[J]. 计算机工程, 2025, 51(2): 223-237.
[8]	张会影, 圣文顺. 基于标记适应的人脸年龄识别优化算法[J]. 计算机工程, 2025, 51(1): 174-181.
[9]	杨红菊, 吉昌. 学习驱动的图像压缩算法研究[J]. 计算机工程, 2025, 51(1): 190-197.
[10]	王晓路, 汶建荣. 基于运动-时间感知的人体动作识别方法[J]. 计算机工程, 2025, 51(1): 216-224.
[11]	火久元, 苏泓瑞, 武泽宇, 王婷娟. 基于改进YOLOv8的道路交通小目标车辆检测算法[J]. 计算机工程, 2025, 51(1): 246-257.
[12]	王骞, 张俊华, 王泽彤, 李博. X2S-Net: 基于双平面X线片的脊柱三维重建[J]. 计算机工程, 2025, 51(1): 277-286.
[13]	李猛坤, 袁晨, 王琪, 赵冲, 陈景轩, 刘立峰. 基于改进YOLOv8算法的在线听课行为识别模型研究[J]. 计算机工程, 2025, 51(1): 287-294.
[14]	严洁, 张烨菲, 张显飞. 基于CAE和改进式VGGNet的心电身份识别算法[J]. 计算机工程, 2025, 51(1): 295-303.
[15]	易鹏, 杨晔, 严仕嘉. 基于MPCNN模型的sEMG快速迁移学习的手势识别应用研究[J]. 计算机工程, 2025, 51(1): 304-311.

选择文件类型/文献管理软件名称

选择包含的内容