短期动作预测深度学习方法综述

doi:10.19678/j.issn.1000-3428.0252357

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 31-52. doi: 10.19678/j.issn.1000-3428.0252357

短期动作预测深度学习方法综述

孙海峰¹, 姚俊萍¹, 李晓军¹^,*(), 刘延飞², 辜弘炀¹

1. 火箭军工程大学作战保障学院, 陕西西安 710025
2. 火箭军工程大学基础部, 陕西西安 710025

收稿日期:2025-04-22 修回日期:2025-07-20 出版日期:2026-06-15 发布日期:2025-08-21
通讯作者: 李晓军
作者简介:
孙海峰, 男, 博士研究生, 主研方向为动作识别、人-物交互检测
姚俊萍, 教授、博士
李晓军(通信作者), 副教授、博士
刘延飞, 教授、博士
辜弘炀, 讲师、博士
基金资助:
国家自然科学基金(62401609); 中国博士后基金(2024M754275); 陕西省自然科学基础研究计划项目(2025JC-YBMS-783)

Review of Deep Learning Methods for Short-Term Action Anticipation

SUN Haifeng¹, YAO Junping¹, LI Xiaojun¹^,*(), LIU Yanfei², GU Hongyang¹

1. School of Operational Support, Rocket Force University of Engineering, Xi'an 710025, Shaanxi, China
2. Department of Basic, Rocket Force University of Engineering, Xi'an 710025, Shaanxi, China

Received:2025-04-22 Revised:2025-07-20 Online:2026-06-15 Published:2025-08-21
Contact: LI Xiaojun

摘要/Abstract

摘要：

短期动作预测作为视频理解领域的重要任务, 旨在通过建模历史动作的时空与语义特征, 将观测到的物理动作转化为对动作意图和目标的推断, 精准预测未来数秒内的交互行为, 在人机协作、安防监控、自动驾驶、增强现实等领域具有广泛应用前景。近年来, 特征提取模型的革新与高质量数据集的构建共同推动了视频理解领域的发展, 并使短期动作预测从知识驱动的机器学习范式转向数据驱动的深度学习范式。系统回顾了该领域在深度学习方法中的最新技术, 以期为相关研究及场景应用分析提供借鉴和参考。首先, 从模型架构创新、训练策略应用与上下文建模方法3个维度构建分类体系, 分析领域内关键技术与挑战, 并对每类方法的特点、适用场景及研究进展进行阐述。然后, 简要归纳任务中常用的数据集并梳理多种方法在主流数据集上的性能对比。最后, 提出当前面临的挑战, 从多视角协同预测、实时模型推理验证、弱监督未裁剪数据学习、小样本类增量泛化研究、动态开放场景自适应、可变时间间隔预测等未来可能的研究方向进行展望。

关键词: 视频理解, 短期动作预测, 语义动作, 深度学习, 训练策略

Abstract:

Short-term action anticipation, a crucial task in video understanding, involves transforming observed physical motions into inferences about action intentions and goals by modeling the spatiotemporal and semantic features of historical actions. It enables the precise prediction of interactive behaviors within the next few seconds and has broad application prospects in human-machine collaboration, security surveillance, autonomous driving, and augmented reality. Recent advances in deep learning, particularly innovations in feature extraction models and the construction of high-quality datasets within the field of video understanding, have propelled the development of this domain. This progress has shifted short-term action anticipation has transitioned from a knowledge-driven machine learning paradigm to a data-driven deep learning paradigm. This survey systematically reviews the latest advancements in deep learning methods for short-term action anticipation, providing references and insights for related research and practical application analysis. For this purpose, a classification framework is first constructed from three perspectives: model architecture innovation, training strategy application, and contextual modeling methods. Within this framework, key technologies and challenges in the field are analyzed, and the characteristics, applicable scenarios, and research progress of each method category are elaborated. Next, datasets commonly used for this task are summarized, and the performances of various methods are compared on mainstream datasets. Finally, the current challenges and future research directions are outlined, including multi-view collaborative prediction, real-time model inference verification, weakly supervised learning from untrimmed data, few-shot class-incremental generalization, dynamic open-scene adaptation, and variable time interval prediction.

Key words: video understanding, short-term action anticipation, semantic action, deep learning, training strategy

孙海峰, 姚俊萍, 李晓军, 刘延飞, 辜弘炀. 短期动作预测深度学习方法综述[J]. 计算机工程, 2026, 52(6): 31-52.

SUN Haifeng, YAO Junping, LI Xiaojun, LIU Yanfei, GU Hongyang. Review of Deep Learning Methods for Short-Term Action Anticipation[J]. Computer Engineering, 2026, 52(6): 31-52.

https://www.ecice06.com/CN/Y2026/V52/I6/31

图/表 16

图1 本文研究架构

Fig.1 Research architecture of the proposed methodology

图2 典型主从式架构的S-GEAR模型

Fig.2 S-GEAR model with typical leader-follower architecture

图3 典型并行式架构的UADT模型

Fig.3 UADT model with typical parallel architecture

图4 典型迭代式复合架构的SF-GRU模型

Fig.4 SF-GRU model with typical iterative composite architecture

图5 将未来特征预测作为短期动作预测的中间任务的示意图

Fig.5 Schematic diagram of future feature prediction as an intermediate task for short-term action prediction

图6 GTF模型架构

Fig.6 GTF model architecture

图7 将整个序列的目标定义为动作序列末尾的最终视觉表征的示意图

Fig.7 Schematic diagram of the entire sequence objective is defined as the final visual representation at the end of the action sequence

参考文献 94

1	MASCARO E V, SLIWOWSKI D, LEE D. HOI4ABOT: human-object interaction anticipation for human intention reading collaborative roBOTs[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2309.16524.
2	BENMESSABIH T , SLAMA R , HAVARD V , et al. Online human motion analysis in industrial context: a review. Engineering Applications of Artificial Intelligence, 2024, 131, 107850. doi: 10.1016/j.engappai.2024.107850
3	DRAGAN A D, SRINIVASA S. Formalizing assistive teleoperation[M]//ROY N, NEWMAN P, SRINIVASA S. Robotics: science and systems Ⅷ. Cambridge, USA: MIT Press, 2012: 73-80.
4	WANG Z K, DEISENROTH M, BEN AMOR H, et al. Probabilistic modeling of human movements for intention inference[M]//ROY N, NEWMAN P, SRINIVASA S. Robotics: science and systems Ⅷ. Cambridge, USA: MIT Press, 2012: 433-440.
5	KOPPULA H S , SAXENA A . Anticipating human activities using object affordances for reactive robotic response. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016, 38 (1): 14- 29. doi: 10.1109/TPAMI.2015.2430335
6	DANG L M , MIN K , WANG H X , et al. Sensor-based and vision-based human activity recognition: a comprehensive survey. Pattern Recognition, 2020, 108, 107561. doi: 10.1016/j.patcog.2020.107561
7	ZIAEEFARD M , BERGEVIN R . Semantic human activity recognition: a literature review. Pattern Recognition, 2015, 48 (8): 2329- 2345. doi: 10.1016/j.patcog.2015.03.006
8	GRAUMAN K, WESTBURY A, BYRNE E, et al. Ego4D: around the world in 3, 000 hours of egocentric video[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 18995-19012.
9	PASCA R, GAVRYUSHIN A, HAMZA M, et al. Summarize the past to predict the future: natural language descriptions of context boost multimodal object interaction anticipation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 18286-18296.
10	MUR-LABADIA L, MARTINEZ-CANTIN R, GUERRERO J J, et al. AFF-ttention! Affordances and attention models for short-term object interaction anticipation[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2025: 167-184.
11	FURNARI A, BATTIATO S, FARINELLA G M. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation[C]//Proceedings of ECCV'19. Berlin, Germany: Springer, 2019: 389-405.
12	PEI B Q, CHEN G, XU J L, et al. EgoVideo: exploring egocentric foundation model and downstream adaptation[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2406.18070.
13	RAJASEGARAN J, RADOSAVOVIC I, RAVISHANKAR R, et al. An empirical study of autoregressive pre-training from videos[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2501.05453.
14	CHO H, KANG D U, CHUN S Y. Short-term object interaction anticipation with disentangled object detection @ Ego4D short term object interaction anticipation challenge[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2407.05713.
15	CHEN G, XING S, CHEN Z, et al. InternVideo-Ego4D: a pack of champion solutions to Ego4D challenges[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2211.09529.
16	RAGUSA F, FARINELLA G M, FURNARI A. StillFast: an end-to-end approach for short-term object interaction anticipation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2023: 3636-3645.
17	THAKUR S, BEYAN C, MORERIO P, et al. Guided attention for next active object @ EGo4D STA challenge[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2305.16066.
18	KIM S, HUANG D J, XIAN Y Q, et al. PALM: predicting actions through language models[C]//Proceedings of ECCV'24. Berlin, Germany: Springer, 2024: 140-158.
19	LAI B L, TOYER S, NAGARAJAN T, et al. Human action anticipation: a survey[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2410.14045.
20	TRAN V, WANG Y, ZHANG Z K, et al. Knowledge distillation for human action anticipation[C]//Proceedings of the IEEE International Conference on Image Processing (ICIP). Washington D.C., USA: IEEE Press, 2021: 2518-2522.
21	MANOUSAKI V, PAPOUTSAKIS K, ARGYROS A. Graphing the future: activity and next active object prediction using graph-based activity representations[C]//Proceedings of Advances in Visual Computing. Berlin, Germany: Springer, 2022: 299-312.
22	RASOULI A, KOTSERUBA I, TSOTSOS J K. Pedestrian action anticipation using contextual feature fusion in stacked RNNs[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2005.06582.
23	OSMAN N, CAMPORESE G, COSCIA P, et al. SlowFast rolling-unrolling LSTMs for action anticipation in egocentric videos[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Washington D.C., USA: IEEE Press, 2021: 3430-3438.
24	GIRDHAR R, GRAUMAN K. Anticipative video Transformer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2022: 13485-13495.
25	GU X, QIU J N, GUO Y, et al. TransAction: ICL-SJTU submission to EPIC-kitchens action anticipation challenge 2021[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2107.13259.
26	MIECH A, LAPTEV I, SIVIC J, et al. Leveraging the present to anticipate the future in videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2020: 2915-2922.
27	ZHANG T Y, MIN W Q, ZHU Y, et al. An egocentric action anticipation framework via fusing intuition and analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 402-410.
28	DESSALENE E , DEVARAJ C , MAYNORD M , et al. Forecasting action through contact representations from first person video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (6): 6703- 6714. doi: 10.1109/TPAMI.2021.3055233
29	TAI T M, FIAMENI G, LEE C K, et al. Unified recurrence modeling for video action anticipation[C]//Proceedings of the 26th International Conference on Pattern Recognition (ICPR). Washington D.C., USA: IEEE Press, 2022: 3273-3279.
30	张天予, 闵巍庆, 韩鑫阳, 等. 视频中的未来动作预测研究综述. 计算机学报, 2023, 46 (6): 1315- 1338.
	ZHANG T Y , MIN W Q , HAN X Y , et al. A survey on future action anticipation in videos. Chinese Journal of Computers, 2023, 46 (6): 1315- 1338.
31	NI Z F , VALLS MASCARÓ E , AHN H , et al. Human-object interaction prediction in videos through gaze following. Computer Vision and Image Understanding, 2023, 233, 103741. doi: 10.1016/j.cviu.2023.103741
32	LIU M, TANG S Y, LI Y, et al. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video[C]//Proceedings of ECCV'20. Berlin, Germany: Springer, 2020: 704-721.
33	THAKUR S, BEYAN C, MORERIO P, et al. Enhancing next active object-based egocentric action anticipation with guided attention[C]//Proceedings of the IEEE International Conference on Image Processing (ICIP). Washington D.C., USA: IEEE Press, 2023: 1450-1454.
34	WANG X, ZHANG S W, QING Z W, et al. OadTR: online action detection with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2022: 7545-7555.
35	GIRASE H, AGARWAL N, CHOI C, et al. Latency matters: real-time action forecasting Transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 18759-18769.
36	GUERMAL M, ALI A, DAI R, et al. JOADAA: joint online action detection and action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2024: 6875-6884.
37	CHEN J X, LI X Y, CAO J H, et al. RHINO: learning real-time humanoid-human-object interaction from human demonstrations[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2502.13134.
38	FERNANDO B, HERATH S. Anticipating human actions by correlating past with the future with Jaccard similarity measures[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2021: 13219-13228.
39	ROY D, FERNANDO B. Action anticipation using latent goal learning[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Waikoloa, USA: IEEE Press, 2022: 808-816.
40	XU X Y, LI Y L, LU C W. Learning to anticipate future with dynamic context removal[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 12724-12734.
41	莫凌飞, 蒋红亮, 李煊鹏. 基于深度学习的视频预测研究综述. 智能系统学报, 2018, 13 (1): 85- 96.
	MO L F , JIANG H L , LI X P . Review of deep learning-based video prediction. CAAI Transactions on Intelligent Systems, 2018, 13 (1): 85- 96.
42	LIU T S, LAM K M. A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13894-13903.
43	WU C Y, LI Y H, MANGALAM K, et al. MeMViT: memory-augmented multiscale vision Transformer for efficient long-term video recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 13577-13587.
44	DIKO A, AVOLA D, PRENKAJ B, et al. Semantically guided representation learning for action anticipation[C]//Proceedings of the ECCV'25. Berlin, Germany: Springer, 2025: 448-466.
45	CAO C Q , SUN Z , LÜ Q Y , et al. VS-TransGRU: a novel Transformer-GRU-based framework enhanced by visual-semantic fusion for egocentric action anticipation. IEEE Transactions on Circuits and Systems for Video Technology, 2024, 34 (11): 11605- 11618. doi: 10.1109/TCSVT.2024.3425598
46	SENER F, SINGHANIA D, YAO A. Temporal aggregate representations for long-range video understanding[C]//Proceedings of ECCV'20. Berlin, Germany: Springer, 2020: 154-171.
47	GUO H J, AGARWAL N, LO S Y, et al. Uncertainty-aware action decoupling Transformer for action anticipation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2024: 18644-18654.
48	WANG J H, CHEN G, HUANG Y F, et al. Memory-and-anticipation Transformer for online action understanding[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2024: 13778-13789.
49	CAMPORESE G, COSCIA P, FURNARI A, et al. Knowledge distillation for action anticipation via label smoothing[C]//Proceedings of the 25th International Conference on Pattern Recognition (ICPR). Washington D.C., USA: IEEE Press, 2021: 3312-3319.
50	ROY D, FERNANDO B. Predicting the next action by modeling the abstract goal[C]//Proceedings of the Conference on Computer Vision and Pattern Recognition. Berlin, Germany: Springer, 2024: 162-177.
51	QI Z B , WANG S H , ZHANG W G , et al. Uncertainty-boosted robust video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46 (12): 7775- 7792. doi: 10.1109/TPAMI.2024.3393730
52	HAN X , ZHANG Z Y , DING N , et al. Pre-trained models: past, present and future. AI Open, 2021, 2, 225- 250. doi: 10.1016/j.aiopen.2021.08.002
53	VONDRICK C, PIRSIAVASH H, TORRALBA A. Anticipating visual representations from unlabeled video[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2016: 98-106.
54	ZHONG Y, ZHENG W S. Unsupervised learning for forecasting action representations[C]//Proceedings of the 25th IEEE International Conference on Image Processing (ICIP). Washington D.C., USA: IEEE Press, 2018: 1073-1077.
55	WU Y , ZHU L C , WANG X H , et al. Learning to anticipate egocentric actions by imagination. IEEE Transactions on Image Processing, 2021, 30, 1143- 1152. doi: 10.1109/TIP.2020.3040521
56	GUPTA A, LIU J G, BO L F, et al. A-ACT: action anticipation through cycle transformations[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2204.00942.
57	ROTONDO T, FARINELLA G, TOMASELLI V, et al. Action anticipation from multimodal data[C]//Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications. Prague, Czech Republic: Science and Technology Publications, 2019: 154-161.
58	ZATSARYNNA O, ABU FARHA Y, GALL J. Multi-modal temporal convolutional network for anticipating actions in egocentric videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). Washington D.C., USA: IEEE Press, 2021: 2249-2258.
59	SHEN Y, NI B B, LI Z F, et al. Egocentric activity prediction via event modulated attention[C]//Proceedings of ECCV'18. Berlin, Germany: Springer, 2018: 202-217.
60	MANOUSAKI V, BACHARIDIS K, PAPOUTSAKIS K, et al. VLMAH: visual-linguistic modeling of action history for effective action anticipation[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW). Washington D.C., USA: IEEE Press, 2023: 1909-1919.
61	ZHONG Z Y, SCHNEIDER D, VOIT M, et al. Anticipative feature fusion Transformer for multi-modal action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2023: 6057-6066.
62	KIM M H , JUNG J W , LEE E G , et al. Disentangled adaptive fusion Transformer using adversarial perturbation for egocentric action anticipation. Expert Systems with Applications, 2025, 282, 127648.
63	GHOSH S, AGGARWAL T, HOAI M, et al. Text-derived knowledge helps vision: a simple cross-modal distillation for video-based action anticipation[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2210.05991.
64	WANG S S , ZHANG C , WANG L , et al. Long and short-term collaborative decision-making Transformer for online action detection and anticipation. Pattern Recognition, 2025, 168, 111773.
65	XU M Z, XIONG Y J, CHEN H, et al. Long short-term Transformer for online action detection[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2107.03377.
66	NAGARAJAN T, LI Y H, FEICHTENHOFER C, et al. Ego-topo: environment affordances from egocentric video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 160-169.
67	HUANG Y, YANG X S, XU C S. Multimodal global relation knowledge distillation for egocentric action anticipation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 245-254.
68	CHANG C Y, HUANG D A, XU D F, et al. Procedure planning in instructional videos[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 334-350.
69	DAMEN D M, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: the "equation missing" dataset[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 753-771.
70	GIBSON J J. The theory of affordances [M]//JEN J G, WILLIAM M, CINDI K, et al. The people, place, and space reader. London, UK: Routledge, 2014: 56-60.
71	DO T T, NGUYEN A, REID I. AffordanceNet: an end-to-end deep learning approach for object affordance detection[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2018: 5882-5889.
72	MYERS A, TEO C L, FERMVLLER C, et al. Affordance detection of tool parts from geometric features[C]//Proceedings of the IEEE International Conference on Robotics and Automation (ICRA). Washington D.C., USA: IEEE Press, 2015: 1374-1381.
73	NGUYEN A, KANOULAS D, CALDWELL D G, et al. Object-based affordances detection with convolutional neural networks and dense conditional random fields[C]//Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). Washington D.C., USA: IEEE Press, 2017: 5908-5915.
74	NAGARAJAN T, FEICHTENHOFER C, GRAUMAN K. Grounded human-object interaction hotspots from video[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 8687-8696.
75	LUO H C , ZHAI W , ZHANG J , et al. Learning visual affordance grounding from demonstration videos. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35 (11): 16857- 16871. doi: 10.1109/TNNLS.2023.3298638
76	LI G, JAMPANI V, SUN D Q, et al. LOCATE: localize and transfer object parts for weakly supervised affordance grounding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2023: 10922-10931.
77	ROY D, RAJENDIRAN R, FERNANDO B. Interaction region visual Transformer for egocentric action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2024: 6726-6736.
78	LIU S W, TRIPATHI S, MAJUMDAR S, et al. Joint hand motion and interaction hotspots prediction from egocentric videos[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2022: 3272-3282.
79	JIANG J J , NAN Z X , CHEN H , et al. Predicting short-term next-active-object through visual attention and hand position. Neurocomputing, 2021, 433, 212- 222.
80	GUAN J Q, YUAN Y, KITANI K M, et al. Generative hybrid representations for activity forecasting with no-regret learning[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2020: 170-179.
81	FATHI A, REN X F, REHG J M. Learning to recognize objects in egocentric activities[C]//Proceedings of the CVPR'11. Washington D.C., USA: IEEE Press, 2011: 3281-3288.
82	YIN L, YE Z F, REHG J M. Delving into egocentric actions[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D.C., USA: IEEE Press, 2015: 287-295.
83	FATHI A, LI Y, REHG J M. Learning to recognize daily actions using gaze[C]//Proceedings of ECCV'12. Berlin, Germany: Springer, 2012: 314-327.
84	LI Y, LIU M, REHG J M. In the eye of beholder: joint learning of gaze and actions in first person video[C]//Proceedings of ECCV'18. Berlin, Germany: Springer, 2018: 639-655.
85	刘华虓, 于金艳, 宋申苧, 等. 移动互联网信息无障碍研究综述. 吉林大学学报(理学版), 2025, 63 (1): 124- 138.
	LIU H X , YU J Y , SONG S N , et al. A review on information accessibility in mobile internet. Journal of Jilin University (Science Edition), 2025, 63 (1): 124- 138.
86	DAMEN D , DOUGHTY H , FARINELLA G M , et al. Rescaling egocentric vision: collection, pipeline and challenges for EPIC-KITCHENS-100. International Journal of Computer Vision, 2022, 130 (1): 33- 55.
87	SONG Y L , BYRNE E , NAGARAJAN T , et al. Ego4D goal-step: toward hierarchical understanding of procedural activities. Advances in Neural Information Processing Systems, 2023, 36, 38863- 38886.
88	FURNARI A, FARINELLA G. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMs and modality attention[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). Washington D.C., USA: IEEE Press, 2019: 6251-6260.
89	DESSALENE E, MAYNORD M, DEVARAJ C, et al. Egocentric object manipulation graphs[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2006.03201.
90	ROY D , FERNANDO B . Action anticipation using pairwise human-object interactions and Transformers. IEEE Transactions on Image Processing, 2021, 30, 8116- 8129.
91	QI Z B , WANG S H , SU C , et al. Self-regulated learning for egocentric video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45 (6): 6715- 6730.
92	LIU X , HAO C , YU Z T , et al. From recognition to prediction: leveraging sequence reasoning for action anticipation. ACM Transactions on Multimedia Computing, Communications, and Applications, 2024, 20 (11): 1- 19.
93	THAKUR S, BEYAN C, MORERIO P, et al. Leveraging next-active objects for context-aware anticipation in egocentric videos[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). Washington D.C., USA: IEEE Press, 2024: 8642-8651.
94	TONG Z, SONG Y B, WANG J, et al. VideoMAE: masked autoencoders are data-efficient learners for self-supervised video pre-training[EB/OL]. [2025-03-11]. https://arxiv.org/abs/2203.12602.

[1]	胡康源, 郭涛, 穆楠. 基于自注意力机制和动态掩膜机制的文物图像修复方法[J]. 计算机工程, 2026, 52(6): 179-188.
[2]	李亦然, 聂宏宾, 杨紫骞, 卞春江. 密集星场下空间暗弱群组目标检测方法[J]. 计算机工程, 2026, 52(6): 149-159.
[3]	肖泽秋, 李勇, 王霞. 基于PBI-CLA模型的糖尿病患者血糖浓度预测[J]. 计算机工程, 2026, 52(6): 382-390.
[4]	李学相, 郑永利, 张怡泽, 段鹏松. 基于机器学习与预训练模型的流量分析方法综述[J]. 计算机工程, 2026, 52(6): 53-67.
[5]	许旻辰, 屈丹, 司念文, 彭思思, 陈雅淇. 社交媒体虚假信息检测技术研究综述[J]. 计算机工程, 2026, 52(5): 60-80.
[6]	李辉, 刘佳煜, 徐雅萍. 面向医学图像分割的深度学习模型架构与性能评估方法综述[J]. 计算机工程, 2026, 52(5): 81-94.
[7]	成彬, 赵彬兵, 雷华, 何博. 基于双目视觉的钢筋绑扎节点定位方法[J]. 计算机工程, 2026, 52(4): 433-445.
[8]	励皓轩, 张志远, 刘芮, 许沛华, 田昕. 基于隐式神经表达图像超分辨率的气象降尺度[J]. 计算机工程, 2026, 52(4): 376-385.
[9]	王雯, 杨奎武, 仝松松, 魏江宏, 薛岩, 周荣魁. 深度神经网络模型水印攻击研究[J]. 计算机工程, 2026, 52(4): 22-38.
[10]	崔少国, 许松, 王名洋, 周粤. 面向智能教育的深度学习知识追踪研究进展[J]. 计算机工程, 2026, 52(4): 39-61.
[11]	李娇, 范浩东, 洪旭东, 许镇义, 樊旭, 黄俊. 基于标签视觉原型学习的多标签图像分类[J]. 计算机工程, 2026, 52(4): 229-238.
[12]	曹继卫, 罗飞, 丁炜超. BS-YOLO: 基于BSAM注意力机制和SCConv的小目标检测算法[J]. 计算机工程, 2026, 52(3): 119-127.
[13]	张志, 尹昱凯, 孙奕灵, 孟雯锦, 彭畅. 基于多模态特征融合的Android恶意软件检测模型研究[J]. 计算机工程, 2026, 52(3): 243-254.
[14]	张永宏, 孙书林, 龚蒙, 王俊飞, 马光义. 基于多尺度运动记忆模型的遥感云图预测方法[J]. 计算机工程, 2026, 52(3): 128-140.
[15]	刘啸宇, 廖志芳, 谈遂, 余志武. 基于堆叠GRU神经网络的桥梁动应变预测[J]. 计算机工程, 2026, 52(3): 441-450.

选择文件类型/文献管理软件名称

选择包含的内容

短期动作预测深度学习方法综述

Review of Deep Learning Methods for Short-Term Action Anticipation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 94

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

短期动作预测深度学习方法综述

Review of Deep Learning Methods for Short-Term Action Anticipation

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 16

参考文献 94

相关文章 15

编辑推荐

Metrics

本文评价