基于显著目标追踪的行为检测方法

doi:10.19678/j.issn.1000-3428.0069350

摘要/Abstract

摘要：

行为检测任务包含行为分类和边界定位，往往关注行为特征和边界特征。已有方法通常忽略了行为空间特征对于该任务的重要性，并存在行为边界预测模糊的问题，影响行为检测模型的性能和应用效果。针对以上问题，提出一种基于显著目标追踪的行为检测(SOT-AD)方法。首先，为了学习不同尺度的显著空间信息，提出分级注意力网络，旨在捕捉与行为关联的显著目标，减少与行为无关的信息的干扰；其次，为了使相邻时序位置关注到的显著目标具有一致性，提出显著目标追踪损失；最后，引入中性样本辅助构造“目标-次目标-背景”特征池，旨在学习特征时序上下文信息，实现显著目标追踪。在THUMOS14和ActivityNet1.3两个通用数据集上的实验结果表明，与主流方法相比，SOT-AD在平均精度均值(mAP)指标上分别平均提升了0.9和0.6百分点。其中，在THUMOS14数据集上，SOT-AD的mAP@0.5达到72.7%。

关键词: 行为检测, 注意力机制, 噪声对比损失, 行为追踪, 特征金字塔

Abstract:

Action detection comprises both action classification and boundary localization, with a predominant focus on action and boundary features. Current methods neglect the significance of spatial features in this task and suffer from ambiguous action boundary prediction, which affects the performance and application of action detection models. To address these challenges, this paper proposes a Salient Object Tracking-based Action Detection (SOT-AD) method. First, to learn salient spatial information at different scales, a hierarchical attention network is introduced to capture salient objects associated with actions, while reducing interference from action-irrelevant information. Second, to ensure consistency in salient object attention across adjacent temporal positions, this paper proposes a salient object tracking loss. Neutral samples are introduced to construct a ″target-sub-target-background″ feature pool to learn temporal contextual information for feature sequences, which facilitates the realization of salient object tracking. Experimental results on two widely used datasets, THUMOS14 and ActivityNet1.3, demonstrate that SOT-AD outperforms mainstream methods with improvements of 0.9 percentage points and 0.6 percentage points in terms of mean Average Precision (mAP), respectively. Notably, on the THUMOS14 dataset, SOT-AD achieves an mAP@0.5 of 72.7%.

Key words: Action Detection(AD), attention mechanism, noise contrast loss, action tracking, feature pyramid

单鹏畅, 高利剑, 董文龙, 毛启容. 基于显著目标追踪的行为检测方法[J]. 计算机工程, 2025, 51(6): 93-101.

SHAN Pengchang, GAO Lijian, DONG Wenlong, MAO Qirong. Action Detection Method Based on Salient Target Tracking[J]. Computer Engineering, 2025, 51(6): 93-101.

https://www.ecice06.com/CN/Y2025/V51/I6/93

图/表 9

图1 基于显著目标追踪的行为检测模型结构

Fig.1 Structure of action detection model based on salient target tracking

图2 部分视频帧

Fig.2 Partial video frames

图3 多尺度分级注意力示意图

Fig.3 Multi-scale hierarchical attention diagram

图4 目标特征池、次目标特征池、背景特征池

Fig.4 Target feature pool, sub-target feature pool, and background feature pool

图5 注意力分数计算结果

Fig.5 The calculation results of attention scores

图6 超参数τ₀、τ₁消融实验结果

Fig.6 Ablation experiment results of hyperparameters τ₀ and τ₁

参考文献 36

1	VAHDANI E , TIAN Y L . Deep learning-based action detection in untrimmed videos: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (4): 4302- 4320. doi: 10.1109/TPAMI.2022.3193611
2	安峰民, 张冰冰, 董微, 等. 面向视频行为识别深度模型的数据预处理方法. 计算机工程, 2024, 50 (2): 281- 287. doi: 10.19678/j.issn.1000-3428.0066795
	AN F M , ZHANG B B , DONG W , et al. A data preprocessing method for video action recognition depth models. Computer Engineering, 2024, 50 (2): 281- 287. doi: 10.19678/j.issn.1000-3428.0066795
3	张杰豪, 陈华杰, 姚勤炜, 等. 基于行为主体检测的视频行为快速检测. 计算机工程, 2019, 45 (12): 257- 262. doi: 10.19678/j.issn.1000-3428.0053184
	ZHANG J H , CHEN H J , YAO Q W , et al. Fast video action detection based on action subject detection. Computer Engineering, 2019, 45 (12): 257- 262. doi: 10.19678/j.issn.1000-3428.0053184
4	WANG H , KLÄSER A , SCHMID C , et al. Dense Trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103 (1): 60- 79. doi: 10.1007/s11263-012-0594-8
5	WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2013: 3551-3558.
6	HU K , JIN J L , ZHENG F , et al. Overview of behavior recognition based on deep learning. Artificial Intelligence Review, 2023, 56 (3): 1833- 1865. doi: 10.1007/s10462-022-10210-8
7	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2015: 4489-4497.
8	DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 2625-2634.
9	NG J Y, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 4694-4702.
10	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 20-36.
11	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the Kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 4724-4733.
12	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 770-778.
13	SHOU Z, WANG D A, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 1049-1058.
14	LIN T W, ZHAO X, SU H S, et al. BSN: boundary sensitive network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2018: 3-21.
15	ZHAO Y B , ZHANG H , GAO Z , et al. A temporal-aware relation and attention network for temporal action localization. IEEE Transactions on Image Processing, 2022, 31, 4746- 4760. doi: 10.1109/TIP.2022.3182866
16	ZHANG C L, WU J X, LI Y. ActionFormer: localizing moments of actions with transformers[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2022: 492-510.
17	LI P , CAO J C , YE X C . Prototype contrastive learning for point-supervised temporal action detection. Expert Systems with Applications, 2023, 213, 118965. doi: 10.1016/j.eswa.2022.118965
18	韩璐, 霍纬纲, 张永会, 等. 基于多尺度特征融合与双注意力机制的多元时间序列预测. 计算机工程, 2023, 49 (9): 99- 108. doi: 10.19678/j.issn.1000-3428.0065846
	HAN L , HUO W G , ZHANG Y H , et al. Multivariate time series forecasting based on multi-scale feature fusion and dual-attention mechanism. Computer Engineering, 2023, 49 (9): 99- 108. doi: 10.19678/j.issn.1000-3428.0065846
19	XIA H F , ZHAN Y Z . A survey on temporal action localization. IEEE Access, 2020, 8, 70477- 70487. doi: 10.1109/ACCESS.2020.2986861
20	GUTMANN M, HYVRINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[EB/OL]. [2023-10-05]. http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf.
21	HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S, et al. Learning deep representations by mutual information estimation and maximization[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1808.06670v5.
22	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal Loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 2980-2988.
23	ZHENG Z H , WANG P , LIU W , et al. Distance-IoU Loss: faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 12993- 13000. doi: 10.1609/aaai.v34i07.6999
24	IDREES H , ZAMIR A R , JIANG Y G , et al. The THUMOS challenge on action recognition for videos "in the wild". Computer Vision and Image Understanding, 2017, 155, 1- 23. doi: 10.1016/j.cviu.2016.10.018
25	HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 12-19.
26	CHAO Y W, VIJAYANARASIMHAN S, SEYBOLD B, et al. Rethinking the Faster R-CNN architecture for temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 1130-1139.
27	LIU Q Y , WANG Z L . Progressive boundary refinement network for temporal action detection. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11612- 11619. doi: 10.1609/aaai.v34i07.6829
28	LIN T W, LIU X, LI X, et al. BMN: boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2019: 3889-3898.
29	LIU X L , WANG Q M , HU Y , et al. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 2022, 31, 5427- 5441. doi: 10.1109/TIP.2022.3195321
30	LIN C M, XU C M, LUO D H, et al. Learning salient boundary feature for Anchor-free temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 3320-3329.
31	LIN T W, ZHAO X, SHOU Z, et al. Single shot temporal action detection[C]//Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM Press, 2017: 988-996.
32	XU H J, DAS A, SAENKO K. R-C3D: region convolutional 3D network for temporal activity detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 5783-5792.
33	LONG F C, YAO T, QIU Z F, et al. Gaussian temporal awareness networks for action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 344-353.
34	SU R , XU D , SHENG L , et al. PCG-TAL: progressive cross-granularity cooperation for temporal action localization. IEEE Transactions on Image Processing, 2021, 30, 2103- 2113. doi: 10.1109/TIP.2020.3044218
35	ZHAO Y, XIONG Y J, WANG L M, et al. Temporal action detection with structured segment networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 2914-2923.
36	YANG L , PENG H W , ZHANG D W , et al. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 2020, 29, 8535- 8548. doi: 10.1109/TIP.2020.3016486

[1]	华家宝, 张京瑞, 朱福民, 陈璐. 基于路侧相机的自适应空间变换车辆检测方法[J]. 计算机工程, 2025, 51(6): 349-359.
[2]	刘凯, 任洪逸, 李蓥, 季怡, 刘纯平. 基于交叉模态注意力特征增强的医学视觉问答[J]. 计算机工程, 2025, 51(6): 49-56.
[3]	李毅, 徐慧英, 朱信忠, 黄晓, 王舒梦, 李悉钰. 基于YOLOv5n模型改进的口罩检测算法: Mask-YOLO[J]. 计算机工程, 2025, 51(6): 297-310.
[4]	赵小虎, 谢礼逊, 慕灯聪, 张悦. 基于TCM-YOLO网络的金属表面缺陷检测方法[J]. 计算机工程, 2025, 51(6): 338-348.
[5]	马月坤, 马铭佑. 基于全局与局部特征加权融合的隐喻识别模型[J]. 计算机工程, 2025, 51(5): 143-153.
[6]	汤静雯, 赖惠成, 王同官. 远距离情形下的改进YOLOv8行人检测算法[J]. 计算机工程, 2025, 51(4): 303-313.
[7]	杨萍, 张汐. 改进DeepLabv3+的道路表面裂缝检测方法[J]. 计算机工程, 2025, 51(4): 261-270.
[8]	徐永刚, 孙琦烜, 李凡甲, 程健维, 戴佳俊. 基于扩展时间和时空特征融合图卷积网络的骨架行为识别[J]. 计算机工程, 2025, 51(4): 281-292.
[9]	杜晨阳, 张雪英, 黄丽霞, 李娟. 基于改进高效通道注意力机制的多特征语音情感识别[J]. 计算机工程, 2025, 51(4): 97-106.
[10]	董红亮, 钮焱, 孙杨, 李军. 基于记忆胶囊与注意力的语音情感识别[J]. 计算机工程, 2025, 51(4): 169-177.
[11]	孙子文, 钱立志, 袁广林, 杨传栋, 凌冲. 基于实时动态模板更新的Transformer目标跟踪方法[J]. 计算机工程, 2025, 51(4): 158-168.
[12]	解庆, 张凌峰, 马艳春, 刘永坚. 基于反射分类与梯度恢复的单幅图像去反射模型[J]. 计算机工程, 2025, 51(4): 227-238.
[13]	耿霞, 汪尧. 基于CLIP增强细粒度特征的换装行人重识别方法[J]. 计算机工程, 2025, 51(4): 293-302.
[14]	刘云翔, 梁智超. 一种高效的连续时序图注意力网络的交通预测模型[J]. 计算机工程, 2025, 51(4): 350-359.
[15]	胡倩, 皮建勇, 胡伟超, 黄昆, 王娟敏. 基于改进YOLOv5的密集行人检测算法[J]. 计算机工程, 2025, 51(3): 216-228.

选择文件类型/文献管理软件名称

选择包含的内容