Action Detection Method Based on Salient Target Tracking

doi:10.19678/j.issn.1000-3428.0069350

Abstract

Abstract:

Action detection comprises both action classification and boundary localization, with a predominant focus on action and boundary features. Current methods neglect the significance of spatial features in this task and suffer from ambiguous action boundary prediction, which affects the performance and application of action detection models. To address these challenges, this paper proposes a Salient Object Tracking-based Action Detection (SOT-AD) method. First, to learn salient spatial information at different scales, a hierarchical attention network is introduced to capture salient objects associated with actions, while reducing interference from action-irrelevant information. Second, to ensure consistency in salient object attention across adjacent temporal positions, this paper proposes a salient object tracking loss. Neutral samples are introduced to construct a ″target-sub-target-background″ feature pool to learn temporal contextual information for feature sequences, which facilitates the realization of salient object tracking. Experimental results on two widely used datasets, THUMOS14 and ActivityNet1.3, demonstrate that SOT-AD outperforms mainstream methods with improvements of 0.9 percentage points and 0.6 percentage points in terms of mean Average Precision (mAP), respectively. Notably, on the THUMOS14 dataset, SOT-AD achieves an mAP@0.5 of 72.7%.

Key words: Action Detection(AD), attention mechanism, noise contrast loss, action tracking, feature pyramid

摘要：

行为检测任务包含行为分类和边界定位，往往关注行为特征和边界特征。已有方法通常忽略了行为空间特征对于该任务的重要性，并存在行为边界预测模糊的问题，影响行为检测模型的性能和应用效果。针对以上问题，提出一种基于显著目标追踪的行为检测(SOT-AD)方法。首先，为了学习不同尺度的显著空间信息，提出分级注意力网络，旨在捕捉与行为关联的显著目标，减少与行为无关的信息的干扰；其次，为了使相邻时序位置关注到的显著目标具有一致性，提出显著目标追踪损失；最后，引入中性样本辅助构造“目标-次目标-背景”特征池，旨在学习特征时序上下文信息，实现显著目标追踪。在THUMOS14和ActivityNet1.3两个通用数据集上的实验结果表明，与主流方法相比，SOT-AD在平均精度均值(mAP)指标上分别平均提升了0.9和0.6百分点。其中，在THUMOS14数据集上，SOT-AD的mAP@0.5达到72.7%。

关键词: 行为检测, 注意力机制, 噪声对比损失, 行为追踪, 特征金字塔

SHAN Pengchang, GAO Lijian, DONG Wenlong, MAO Qirong. Action Detection Method Based on Salient Target Tracking[J]. Computer Engineering, 2025, 51(6): 93-101.

单鹏畅, 高利剑, 董文龙, 毛启容. 基于显著目标追踪的行为检测方法[J]. 计算机工程, 2025, 51(6): 93-101.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069350

https://www.ecice06.com/EN/Y2025/V51/I6/93

Figures/Tables 9

Fig.1 Structure of action detection model based on salient target tracking

Fig.2 Partial video frames

Fig.3 Multi-scale hierarchical attention diagram

Fig.4 Target feature pool, sub-target feature pool, and background feature pool

Fig.5 The calculation results of attention scores

Fig.6 Ablation experiment results of hyperparameters τ₀ and τ₁

References 36

1	VAHDANI E , TIAN Y L . Deep learning-based action detection in untrimmed videos: a survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (4): 4302- 4320. doi: 10.1109/TPAMI.2022.3193611
2	安峰民, 张冰冰, 董微, 等. 面向视频行为识别深度模型的数据预处理方法. 计算机工程, 2024, 50 (2): 281- 287. doi: 10.19678/j.issn.1000-3428.0066795
	AN F M , ZHANG B B , DONG W , et al. A data preprocessing method for video action recognition depth models. Computer Engineering, 2024, 50 (2): 281- 287. doi: 10.19678/j.issn.1000-3428.0066795
3	张杰豪, 陈华杰, 姚勤炜, 等. 基于行为主体检测的视频行为快速检测. 计算机工程, 2019, 45 (12): 257- 262. doi: 10.19678/j.issn.1000-3428.0053184
	ZHANG J H , CHEN H J , YAO Q W , et al. Fast video action detection based on action subject detection. Computer Engineering, 2019, 45 (12): 257- 262. doi: 10.19678/j.issn.1000-3428.0053184
4	WANG H , KLÄSER A , SCHMID C , et al. Dense Trajectories and motion boundary descriptors for action recognition. International Journal of Computer Vision, 2013, 103 (1): 60- 79. doi: 10.1007/s11263-012-0594-8
5	WANG H, SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2013: 3551-3558.
6	HU K , JIN J L , ZHENG F , et al. Overview of behavior recognition based on deep learning. Artificial Intelligence Review, 2023, 56 (3): 1833- 1865. doi: 10.1007/s10462-022-10210-8
7	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2015: 4489-4497.
8	DONAHUE J, HENDRICKS L A, GUADARRAMA S, et al. Long-term recurrent convolutional networks for visual recognition and description[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 2625-2634.
9	NG J Y, HAUSKNECHT M, VIJAYANARASIMHAN S, et al. Beyond short snippets: deep networks for video classification[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 4694-4702.
10	WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]// Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 20-36.
11	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? A new model and the Kinetics dataset[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2017: 4724-4733.
12	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 770-778.
13	SHOU Z, WANG D A, CHANG S F. Temporal action localization in untrimmed videos via multi-stage CNNs[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2016: 1049-1058.
14	LIN T W, ZHAO X, SU H S, et al. BSN: boundary sensitive network for temporal action proposal generation[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2018: 3-21.
15	ZHAO Y B , ZHANG H , GAO Z , et al. A temporal-aware relation and attention network for temporal action localization. IEEE Transactions on Image Processing, 2022, 31, 4746- 4760. doi: 10.1109/TIP.2022.3182866
16	ZHANG C L, WU J X, LI Y. ActionFormer: localizing moments of actions with transformers[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2022: 492-510.
17	LI P , CAO J C , YE X C . Prototype contrastive learning for point-supervised temporal action detection. Expert Systems with Applications, 2023, 213, 118965. doi: 10.1016/j.eswa.2022.118965
18	韩璐, 霍纬纲, 张永会, 等. 基于多尺度特征融合与双注意力机制的多元时间序列预测. 计算机工程, 2023, 49 (9): 99- 108. doi: 10.19678/j.issn.1000-3428.0065846
	HAN L , HUO W G , ZHANG Y H , et al. Multivariate time series forecasting based on multi-scale feature fusion and dual-attention mechanism. Computer Engineering, 2023, 49 (9): 99- 108. doi: 10.19678/j.issn.1000-3428.0065846
19	XIA H F , ZHAN Y Z . A survey on temporal action localization. IEEE Access, 2020, 8, 70477- 70487. doi: 10.1109/ACCESS.2020.2986861
20	GUTMANN M, HYVRINEN A. Noise-contrastive estimation: a new estimation principle for unnormalized statistical models[EB/OL]. [2023-10-05]. http://proceedings.mlr.press/v9/gutmann10a/gutmann10a.pdf.
21	HJELM R D, FEDOROV A, LAVOIE-MARCHILDON S, et al. Learning deep representations by mutual information estimation and maximization[EB/OL]. [2023-10-05]. https://arxiv.org/abs/1808.06670v5.
22	LIN T Y, GOYAL P, GIRSHICK R, et al. Focal Loss for dense object detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 2980-2988.
23	ZHENG Z H , WANG P , LIU W , et al. Distance-IoU Loss: faster and better learning for bounding box regression. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 12993- 13000. doi: 10.1609/aaai.v34i07.6999
24	IDREES H , ZAMIR A R , JIANG Y G , et al. The THUMOS challenge on action recognition for videos "in the wild". Computer Vision and Image Understanding, 2017, 155, 1- 23. doi: 10.1016/j.cviu.2016.10.018
25	HEILBRON F C, ESCORCIA V, GHANEM B, et al. ActivityNet: a large-scale video benchmark for human activity understanding[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2015: 12-19.
26	CHAO Y W, VIJAYANARASIMHAN S, SEYBOLD B, et al. Rethinking the Faster R-CNN architecture for temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 1130-1139.
27	LIU Q Y , WANG Z L . Progressive boundary refinement network for temporal action detection. Proceedings of the AAAI Conference on Artificial Intelligence, 2020, 34 (7): 11612- 11619. doi: 10.1609/aaai.v34i07.6829
28	LIN T W, LIU X, LI X, et al. BMN: boundary-matching network for temporal action proposal generation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2019: 3889-3898.
29	LIU X L , WANG Q M , HU Y , et al. End-to-end temporal action detection with transformer. IEEE Transactions on Image Processing, 2022, 31, 5427- 5441. doi: 10.1109/TIP.2022.3195321
30	LIN C M, XU C M, LUO D H, et al. Learning salient boundary feature for Anchor-free temporal action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2021: 3320-3329.
31	LIN T W, ZHAO X, SHOU Z, et al. Single shot temporal action detection[C]//Proceedings of the 25th ACM International Conference on Multimedia. New York, USA: ACM Press, 2017: 988-996.
32	XU H J, DAS A, SAENKO K. R-C3D: region convolutional 3D network for temporal activity detection[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 5783-5792.
33	LONG F C, YAO T, QIU Z F, et al. Gaussian temporal awareness networks for action localization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2019: 344-353.
34	SU R , XU D , SHENG L , et al. PCG-TAL: progressive cross-granularity cooperation for temporal action localization. IEEE Transactions on Image Processing, 2021, 30, 2103- 2113. doi: 10.1109/TIP.2020.3044218
35	ZHAO Y, XIONG Y J, WANG L M, et al. Temporal action detection with structured segment networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2017: 2914-2923.
36	YANG L , PENG H W , ZHANG D W , et al. Revisiting anchor mechanisms for temporal action localization. IEEE Transactions on Image Processing, 2020, 29, 8535- 8548. doi: 10.1109/TIP.2020.3016486

[1]	HUA Jiabao, ZHANG Jingrui, ZHU Fumin, CHEN Lu. Adaptive Spatial Transformation Method for Vehicle Detection Based on Roadside Cameras [J]. Computer Engineering, 2025, 51(6): 349-359.
[2]	LIU Kai, REN Hongyi, LI Ying, JI Yi, LIU Chunping. Medical Visual Question Answering Based on Cross-Modal Attention Feature Enhancement [J]. Computer Engineering, 2025, 51(6): 49-56.
[3]	LI Yi, XU Huiying, ZHU Xinzhong, HUANG Xiao, WANG Shumeng, LI Xiyu. Mask-YOLO: Improved Mask Detection Algorithm Based on YOLOv5n [J]. Computer Engineering, 2025, 51(6): 297-310.
[4]	ZHAO Xiaohu, XIE Lixun, MU Dengcong, ZHANG Yue. Metal Surface Defect Detection Method Based on TCM-YOLO Network [J]. Computer Engineering, 2025, 51(6): 338-348.
[5]	MA Yuekun, MA Mingyou. Metaphor Recognition Model Based on Weighted Integration of Global and Local Features [J]. Computer Engineering, 2025, 51(5): 143-153.
[6]	TANG Jingwen, LAI Huicheng, WANG Tongguan. Improved YOLOv8 Pedestrian Detection Algorithm for Long-Distance Situations [J]. Computer Engineering, 2025, 51(4): 303-313.
[7]	DU Chenyang, ZHANG Xueying, HUANG Lixia, LI Juan. Multi-Feature Speech Emotion Recognition Based on Improved Efficient Channel Attention Mechanism [J]. Computer Engineering, 2025, 51(4): 97-106.
[8]	DONG Hongliang, NIU Yan, SUN Yang, LI Jun. Speech Emotion Recognition Based on Memory Capsules and Attention [J]. Computer Engineering, 2025, 51(4): 169-177.
[9]	SUN Ziwen, QIAN Lizhi, YUAN Guanglin, YANG Chuandong, LING Chong. Transformer Object Tracking Method Based on Real-Time Dynamic Template Update [J]. Computer Engineering, 2025, 51(4): 158-168.
[10]	YANG Ping, ZHANG Xi. Improved DeepLabv3+ Road Surface Crack Detection Method [J]. Computer Engineering, 2025, 51(4): 261-270.
[11]	XU Yonggang, SUN Qixuan, LI Fanjia, CHENG Jianwei, DAI Jiajun. Skeleton Behavior Recognition Based on Extended Temporal and Spatiotemporal Feature Fusion Graph Convolutional Network [J]. Computer Engineering, 2025, 51(4): 281-292.
[12]	XIE Qing, ZHANG Lingfeng, MA Yanchun, LIU Yongjian. Single Image Reflection Removal Model Based on Reflection Classifier and Gradient Restorer [J]. Computer Engineering, 2025, 51(4): 227-238.
[13]	GENG Xia, WANG Yao. Cloth-Changing Person Re-Identification Method Based on CLIP Enhanced Fine-Grained Features [J]. Computer Engineering, 2025, 51(4): 293-302.
[14]	LIU Yunxiang, LIANG Zhichao. A Highly Efficient Traffic Prediction Model for Continuous Time-series Graph Attention Networks [J]. Computer Engineering, 2025, 51(4): 350-359.
[15]	HU Qian, PI Jianyong, HU Weichao, HUANG Kun, WANG Juanmin. Dense Pedestrian Detection Algorithm Based on Improved YOLOv5 [J]. Computer Engineering, 2025, 51(3): 216-228.

Please choose a citation manager

Content to export