短期动作预测深度学习方法综述

doi:10.19678/j.issn.1000-3428.0252357

摘要/Abstract

摘要： 短期动作预测作为视频理解领域的重要任务，旨在通过建模历史动作的时空与语义特征，将观测到的物理动作转化为对动作意图和目标的推断，精准预测未来数秒内的交互行为，在人机协作、安防监控、自动驾驶、增强现实等领域具有广泛应用前景。近年，随着深度学习尤其是特征提取模型和高质量数据集在视频理解领域的突破，短期动作预测已经从知识驱动的机器学习范式转向数据驱动的深度学习范式。本综述系统回顾了该领域在深度学习方法中的最新技术，以期为相关研究及场景应用分析提供借鉴和参考。首先从模型架构创新、训练策略应用与上下文建模方法三个维度构建分类体系，分析领域内关键技术与挑战，并对每类方法的特点、适用场景及研究进展进行阐述。然后简要归纳任务中常用的数据集并梳理多种方法在主流数据集上的性能对比。最后提出了当前面临的挑战，从多视角协同预测、实时模型推理验证、弱监督未裁剪数据学习、小样本类增量泛化研究、动态开放场景自适应、可变时间间隔预测等未来可能的研究方向进行展望。

Abstract: Short-term action anticipation, a crucial task in video understanding, aims to model spatiotemporal and semantic features of historical actions to infer behavioral intentions and goals from observed physical motions. This technology enables precise prediction of interactive behaviors within the next few seconds. It demonstrates broad application prospects in human-machine collaboration, security surveillance, autonomous driving, and augmented reality. In recent years, with breakthroughs in deep learning，particularly in feature extraction models and high-quality datasets within the field of video understanding，short-term action anticipation has transitioned from knowledge-driven machine learning paradigms to data-driven deep learning frameworks. This survey systematically reviews the latest advancements in deep learning methods for short-term action anticipation, aiming to provide references and insights to related research and practical application . The analysis establishes a classification framework through three dimensions: model architecture innovation, training strategy implementation, and contextual modeling approaches. It examines core technologies and challenges, while detailing the characteristics, applicable scenarios, and research progress of each methodology category. Finally, potential future research directions were summarized and prospected, including multi-view collaborative prediction, real-time model inference validation, weakly-supervised learning from untrimmed data, few-shot class-incremental tiveization, dynamic open-scene adaptation, variable time interval anticipation.

孙海峰, 姚俊萍, 李晓军, 刘延飞, 辜弘炀. 短期动作预测深度学习方法综述[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252357.

Sun Haifeng, Yao Junping, Li Xiaojun, †, Liu Yanfei, Gu Hongyang. Review of Deep Learning Methods for Short-Term Action Anticipation[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252357.

参考文献

[1] VALLS MASCARO E, SLIWOWSKI D, LEE D. HOI4ABOT: Hum-an-Object Interaction Anticipation for Human Intention Reading Co-llaborative roBOTs[J/OL]. https://doi.org/10.48550/arXiv.2309.16524,2023.
[2] BENMESSABIH T, SLAMA R, HAVARD V, et al. Online human motion analysis in industrial context: A review[J]. Engineering Applications of Artificial Intelligence, 2024, 131: 107850.
[3] DRAGAN A D, SRINIVASA S S. Formalizing assistive teleoperation[J]. Robotics: Science and Systems, 2012: 73-80.
[4] WANG Z, DEISENROTH M P, AMOR H B, et al. Probabilistic modeling of human movements for intention inference[J]. Robotics: Science and systems, 2012, 8: 433-440
[5] KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive
robotic response[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2016, 38(1): 14-29. [6] DANG L M, MIN K, WANG H, et al. Sensor-based and vision-based human activity recognition: A comprehensive survey[J]. Pattern Recognition, 2020, 108: 107561.
[7] ZIAEEFARD M, BERGEVIN R. Semantic human activity recognition: A literature review[J]. Pattern Recognition, 2015, 48(8): 2329-2345.
[8] GRAUMAN K, WESTBURY A, BYRNE E, et al. Ego4d: Around the world in 3,000 hours of egocentric video [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 18995-19012.
[9] PASCA R G, GAVRYUSHIN A, HAMZA M, et al. Summarize the past to predict the future: Natural language descriptions of context boost mul-timodal object interaction anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2024: 18286-18296.
[10] MUR-LABADIA L, MARTINEZ-CANTIN R, GUERRERO J J, et al. AFF-ttention! Affordances and attention models for short-term object interaction anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 167-184.
[11] FURNARI A, BATTIATO S, MARIA FARINELLA G. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 389-405.
[12] PEI B, CHEN G, XU J, et al. Egovideo: Exploring egocentric foundation model and downstream adaptation[J/OL]. https://doi.org/10.48550/arXiv.2406.18070, 2024.
[13] RAJASEGARAN J, RADOSAVOVIC I, RAVISHANKAR R, et al. An empirical study of autoregressive pre-training from videos[J/OL]. https://doi.org/10.48550/arXiv.2501.05453, 2025.
[14] CHO H, KANG D U, CHUN S Y. Short-term object interaction anticipation with disentangled object detection@ ego4d short term object interaction anticipation challenge[J/OL]. https://doi.org/10.48550/arXiv.2407.05713, 2024.
[15] CHEN G, XING S, CHEN Z, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges[J/OL]. https://doi.org/10.48550/arXiv.2211.09529, 2022.
[16] RAGUSA F, FARINELLA G M, FURNARI A. Stillfast: An end-to-end approach for short-term object interaction anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 3636-3645.
[17] THAKUR S, BEYAN C, MORERIO P, et al. Guided attention for next active object@ ego4d STA challenge[J/OL]. https://doi.org/10.48550/arXiv.2305.16066, 2023.
[18] KIM S, HUANG D, XIAN Y, et al. Palm: Predicting actions through language models [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 140-158.
[19] LAI B, TOYER S, NAGARAJAN T, et al. Human Action Anticipation: A Survey[J/OL]. https://doi.org/10.48550/arXiv.2410.14045, 2024.
[20] TRAN V, WANG Y, ZHANG Z, et al. Knowledge distillation for human action anticipation [C]// International Conference on Image Processing. New York: IEEE, 2021: 2518-2522.
[21] MANOUSAKI V, PAPOUTSAKIS K, ARGYROS A. Graphing the future: Activity and next active object prediction using graph-based activity representations[J]. Advances in Visual Computing. 2022(13598): 299-312.
[22] RASOULI A, KOTSERUBA I, TSOTSOS J K. Pedestrian action antic-ipation using contextual feature fusion in stacked RNNs[J/OL]. https://doi.org/10.48550/arXiv.2005.06582, 2020.
[23] OSMAN N, CAMPORESE G, COSCIA P, et al. Slowfast roll-ing-unrolling LSTMS for action anticipation in egocentric videos [C]// International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 3437-3445.
[24] GIRDHAR R, GRAUMAN K. Anticipative video transformer [C]// International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 13505-13515.
[25] GU X, QIU J, GUO Y, et al. Transaction: ICL-SJTU submission to epic-kitchens action anticipation challenge 2021[J/OL]. https://doi.org/10.48550/arXiv.2107.13259, 2021.
[26] MIECH A, LAPTEV I, SIVIC J, et al. Leveraging the present to anticipate the future in videos [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2019: 2915-2922
[27] ZHANG T, MIN W, ZHU Y, et al. An egocentric action anticipation framework via fusing intuition and analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 402-410.
[28] DESSALENE E, DEVARAJ C, MAYNORD M, et al. Forecasting action through contact representations from first person video[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2021, 45(6): 6703-6714.
[29] TAI T M, FIAMENI G, LEE C K, et al. Unified recurrence modeling for video action anticipation [C]// International Conference on Pattern Recognition. New York: IEEE, 2022: 3273-3279.
[30] 张天予, 闵巍庆, 韩鑫阳,等. 视频中的未来动作预测研究综述[J]. 计算机学报, 2023, 46(6): 1315-1338. (ZHANG Tianyu, MIN Weiqing, HAN Xinyang, et al. A Survey on Future Action Anticipation in Videos[J]. Chinese Journal of Computers, 2023, 46(6): 1315-1338)
[31] NI Z, MASCARO E V, AHN H, et al. Human-object interaction prediction in videos through gaze following[J]. Computer Vision and Image Understanding, 2023, 233: 103741.
[32] LIU M, TANG S, LI Y, et al. Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 704-721.
[33] THAKUR S, BEYAN C, MORERIO P, et al. Enhancing next active object-based egocentric action anticipation with guided
attention [C]// International Conference on Image Processing. New York: IEEE, 2023: 1450-1454. [34] WANG X, ZHANG S, QING Z, et al. OADTR: Online action detection with transformers[C] // International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 7565-7575.
[35] GIRASE H, AGARWAL N, CHOI C, et al. Latency matters: Real-time action forecasting transformer[C] // Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 18759-18769.
[36] GUERMAL M, ALI A, DAI R, et al. JOADAA: Joint online action detection and action anticipation[C] // Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 6889-6898.
[37] Chen J, Li X, Cao J, et al. RHINO: Learning real-time humanoid-human-object interaction from human demonstrations[J/OL]. https://doi.org/10.48550/arXiv.2502.13134, 2025.
[38] FERNANDO B, HERATH S. Anticipating human actions by correlating past with the future with jaccard similarity measures [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2021: 13224-13233.
[39] ROY D, FERNANDO B. Action anticipation using latent goal learning [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2022: 2745-2753.
[40] XU X, LI Y L, LU C. Learning to anticipate future with dynamic context removal [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 12734-12744.
[41] 莫凌飞, 蒋红亮, 李煊鹏. 基于深度学习的视频预测研究综述[J]. 智能系统学报, 2018, 13(1): 85-96. (MO Lingfei, JIANG Hongliang, LI Xuanpeng. Review of deep learning-based video prediction[J]. CAAI Transactions on Intelligent Systems, 2018,13(1):85-96.)
[42] LIU T, LAM K M. A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 13904-13913.
[43] WU C Y, LI Y, Mangalam K, et al. MeMVIT: Memory-augmented multiscale vision transformer for efficient long-term video recognition [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 13587-13597.
[44] DIKO A, AVOLA D, PRENKAJ B, et al. Semantically guided represen-tation learning for action anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 448-466.
[45] CAO C, SUN Z, LYU Q, et al. VS-TransGRU: A novel transform-er-GRU-based framework enhanced by visual-semantic fusion for egocentric action anticipation[J]. IEEE Trans on Circuits and Systems for Video Technology. 2024, 34(11): 11605-11618.
[46] SENER F, SINGHANIA D, YAO A. Temporal aggregate representations for long-range video understanding [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 154-171.
[47] GUO H, AGARWAL N, LO S Y, et al. Uncertainty-aware action decoupling transformer for action anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2024: 18644-18654.
[48] WANG J, CHEN G, HUANG Y, et al. Memory-and-anticipation trans-former for online action understanding[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 13824-13835.
[49] ROY D, FERNANDO B. Predicting the next action by modeling the abstract goal[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2025: 162-177.
[50] QI Z, WANG S, ZHANG W, et al. Uncertainty-boosted robust video activity anticipation[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2024,46(12): 7775-7792.
[51] HAN X, ZHANG Z, DING N, et al. Pre-trained models: Past, present and future[J]. AI Open, 2021, 2: 225-250.
[52] VONDRICK C, PIRSIAVASH H, TORRALBA A. Anticipating visual representations from unlabeled video [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2016: 98-106.
[53] ZHONG Y, ZHENG W S. Unsupervised learning for forecasting action representations [C]// International Conference on Image Processing. New York: IEEE, 2018: 1073-1077.
[54] WU Y, ZHU L, WANG X, et al. Learning to anticipate egocentric actions by imagination[J]. IEEE Trans on Image Processing, 2020, 30: 1143-1152.
[55] GUPTA A, LIU J, BO L, et al. A-ACT: Action anticipation through cycle transformations[J/OL]. https://doi.org/10.48550/arXiv.2204.00942, 2022.
[56] ROTONDO T, FARINELLA G M, TOMASELLI V, et al. Action Antic-ipation from Multimodal Data [C]// 14th International Conference on Computer Vision Theory and Applications. Setubal: SCITEPRESS, 2019: 154-161.
[57] ZATSARYNNA O, ABU FARHA Y, GALL J. Multi-modal temporal convolutional network for anticipating actions in egocentric videos [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2021: 2249-2258.
[58] SHEN Y, NI B, LI Z, et al. Egocentric activity prediction via event modulated attention [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 197-212.
[59] MANOUSAKI V, BACHARIDIS K, PAPOUTSAKIS K, et al. VLMAH: Visual-linguistic modeling of action history for effective action anticipation [C]// International Conference on Image Processing. New York: IEEE, 2023: 1917-1927.
[60] ZHONG Z, SCHNEIDER D, VOIT M, et al. Anticipative feature fusion transformer for multi-modal action anticipation [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2023: 6068-6077.
[61] KIM M H, JUNG J W, LEE E G, et al. Disentangled adaptive fusion transformer using adversarial perturbation for egocentric action antici-pation[J]. Expert Systems with Applications, 2025: 127648.
[62] GHOSH S, AGGARWAL T, HOAI M, et al. Text-derived knowledge helps vision: A simple cross-modal distillation for video-based action anticipation[J/OL]. https://doi.org/10.48550/arXiv.2210.05991, 2022.
[63] WANG S, ZHANG C, WANG L, et al. Long and short-term collaborative decision-making transformer for online action detection and anticipation[J]. Pattern Recognition, 2025: 111773.
[64] XU M, XIONG Y, CHEN H, et al. Long short-term transformer for online action detection[J]. Advances in Neural Information Processing Systems, 2021, 34: 1086-1099.
[65] NAGARAJAN T, LI Y, FEICHTENHOFER C, et al. Ego-topo: Envi-ronment affordances from egocentric video[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2020: 163-172.
[66] HUANG Y, YANG X, XU C. Multimodal global relation knowledge distillation for egocentric action anticipation [C]// Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 245-254.
[67] CHANG C Y, HUANG D A, XU D, et al. Procedure planning in in-structional videos [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 334-350.
[68] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: The epic-kitchens dataset [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 720-736.
[69] GIBSON J J. The theory of affordances [M]// JEN J G, WILLIAM M, CINDI K, et al. The people, place, and space reader. London: Routledge, 2014: 56-60.
[70] DO T T, NGUYEN A, REID I. Affordancenet: An end-to-end deep learning approach for object affordance detection [C]// IEEE International Conference on Robotics and Automatio. New York: IEEE, 2018: 5882-5889.
[71] MYERS A, TEO C L, FERMULLER C, et al. Affordance detection of tool parts from geometric features[C]// IEEE International Conference on Robotics and Automation. Los Alamitos: IEEE, 2015: 1374-1381.
[72] NGUYEN A, KANOULAS D, CALDWELL D G, et al. Object-based affordances detection with convolutional neural networks and dense conditional random fields [C]// IEEE/RSJ International Conference on Intelligent Robots and Systems. New York: IEEE, 2017: 5908-5915.
[73] NAGARAJAN T, FEICHTENHOFER C, GRAUMAN K. Grounded human-object interaction hotspots from video[C]// International Conference on Image Processing. New York: IEEE, 2019: 8688-8697.
[74] LUO H, ZHAI W, ZHANG J, et al. Learning visual affordance grounding from demonstration videos[J]. IEEE Trans on Neural Networks and Learning Systems, 2023,35(11): 16857-16871.
[75] LI G, JAMPANI V, SUN D, et al. Locate: Localize and transfer object parts for weakly supervised affordance grounding [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 10922-10931.
[76] ROY D, RAJENDIRAN R, FERNANDO B. Interaction region visual transformer for egocentric action anticipation [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 6740-6750.
[77] LIU S, TRIPATHI S, MAJUMDAR S, et al. Joint hand motion and interaction hotspots prediction from egocentric videos[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 3282-3292.
[78] JIANG J, NAN Z, CHEN H, et al. Predicting short-term next-active-object through visual attention and hand position[J]. Neurocomputing, 2021, 433: 212-222.
[79] GUAN J, YUAN Y, KITANI K M, et al.. Generative Hybrid Represen-tations for Activity Forecasting With No-Regret Learning// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2020: 170-179.
[80] FATHI A, REN X, REHG J M. Learning to recognize objects in egocentric activities [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2011: 3281-3288.
[81] LI Y, YE Z, REHG J M. Delving into egocentric actions [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2015: 287-295.
[82] FATHI A, LI Y, REHG J M. Learning to recognize daily actions using gaze [C]// Proc. Eur. Conf. Comput. Vis. Berlin: Springer, 2012: 314-327.
[83] LI Y, LIU M, REHG J M. In the eye of beholder: Joint learning of gaze and actions in first person video [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 619-635.
[84] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: The epic-kitchens dataset [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 720-736.
[85] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100[J]. International Journal of Computer Vision, 2022: 1-23.
[86] SONG Y, BYRNE E, NAGARAJAN T, et al. Ego4d goal-step: Toward hierarchical understanding of procedural activities[J]. Advances in Neural Information Processing Systems, 2023, 36: 38863-38886.
[87] FURNARI A, FARINELLA G M. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMS and modality attention [C]// International Conference on Image Processing. New York: IEEE, 2019: 6252-6261.
[88] DESSALENE E, MAYNORD M, DEVARAJ C, et al. Egocentric object manipulation graphs[J/OL]. https://doi.org/10.48550/arXiv.2006.03201, 2020.
[89] ROY D, FERNANDO B. Action anticipation using pairwise human-object interactions and transformers[J]. IEEE Transactions on Image Processing, 2021, 30: 8116-8129.
[90] CAMPORESE G, COSCIA P, FURNARI A, et al. Knowledge distillation for action anticipation via label smoothing[C]// International Conference on Pattern Recognition. New York: IEEE, 2021: 3312-3319.
[91] QI Z, WANG S, SU C, et al. Self-regulated learning for egocentric video activity anticipation[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2021, 45(6): 6715-6730. [92] LIU X, HAO C, YU Z, et al. From recognition to prediction: Leveraging sequence re
asoning for action anticipation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(11): 1-19.
[93] THAKUR S, BEYAN C, MORERIO P, et al. Leveraging next-active objects for context-aware anticipation in egocentric videos [C]// Pro-ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 8657-8666.
[94] TONG Z, SONG Y, WANG J, et al. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training[J]. Advances in Neural Information Processing Systems, 2022, 35: 10078-10093.

选择文件类型/文献管理软件名称

选择包含的内容