[1] VALLS MASCARO E, SLIWOWSKI D, LEE D. HOI4ABOT: Hum-an-Object Interaction Anticipation for Human Intention Reading Co-llaborative roBOTs[J/OL]. https://doi.org/10.48550/arXiv.2309.16524,2023.
[2] BENMESSABIH T, SLAMA R, HAVARD V, et al. Online human motion analysis in industrial context: A review[J]. Engineering Applications of Artificial Intelligence, 2024, 131: 107850.
[3] DRAGAN A D, SRINIVASA S S. Formalizing assistive teleoperation[J]. Robotics: Science and Systems, 2012: 73-80.
[4] WANG Z, DEISENROTH M P, AMOR H B, et al. Probabilistic modeling of human movements for intention inference[J]. Robotics: Science and systems, 2012, 8: 433-440
[5] KOPPULA H S, SAXENA A. Anticipating human activities using object affordances for reactive robotic response[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2016, 38(1): 14-29.
[6] DANG L M, MIN K, WANG H, et al. Sensor-based and vision-based human activity recognition: A comprehensive survey[J]. Pattern Recognition, 2020, 108: 107561.
[7] ZIAEEFARD M, BERGEVIN R. Semantic human activity recognition: A literature review[J]. Pattern Recognition, 2015, 48(8): 2329-2345.
[8] GRAUMAN K, WESTBURY A, BYRNE E, et al. Ego4d: Around the world in 3,000 hours of egocentric video [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 18995-19012.
[9] PASCA R G, GAVRYUSHIN A, HAMZA M, et al. Summarize the past to predict the future: Natural language descriptions of context boost mul-timodal object interaction anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2024: 18286-18296.
[10] MUR-LABADIA L, MARTINEZ-CANTIN R, GUERRERO J J, et al. AFF-ttention! Affordances and attention models for short-term object interaction anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 167-184.
[11] FURNARI A, BATTIATO S, MARIA FARINELLA G. Leveraging uncertainty to rethink loss functions and evaluation measures for egocentric action anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 389-405.
[12] PEI B, CHEN G, XU J, et al. Egovideo: Exploring egocentric foundation model and downstream adaptation[J/OL]. https://doi.org/10.48550/arXiv.2406.18070, 2024.
[13] RAJASEGARAN J, RADOSAVOVIC I, RAVISHANKAR R, et al. An empirical study of autoregressive pre-training from videos[J/OL]. https://doi.org/10.48550/arXiv.2501.05453, 2025.
[14] CHO H, KANG D U, CHUN S Y. Short-term object interaction anticipation with disentangled object detection@ ego4d short term object interaction anticipation challenge[J/OL]. https://doi.org/10.48550/arXiv.2407.05713, 2024.
[15] CHEN G, XING S, CHEN Z, et al. Internvideo-ego4d: A pack of champion solutions to ego4d challenges[J/OL]. https://doi.org/10.48550/arXiv.2211.09529, 2022.
[16] RAGUSA F, FARINELLA G M, FURNARI A. Stillfast: An end-to-end approach for short-term object interaction anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 3636-3645.
[17] THAKUR S, BEYAN C, MORERIO P, et al. Guided attention for next active object@ ego4d STA challenge[J/OL]. https://doi.org/10.48550/arXiv.2305.16066, 2023.
[18] KIM S, HUANG D, XIAN Y, et al. Palm: Predicting actions through language models [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 140-158.
[19] LAI B, TOYER S, NAGARAJAN T, et al. Human Action Anticipation: A Survey[J/OL]. https://doi.org/10.48550/arXiv.2410.14045, 2024.
[20] TRAN V, WANG Y, ZHANG Z, et al. Knowledge distillation for human action anticipation [C]// International Conference on Image Processing. New York: IEEE, 2021: 2518-2522.
[21] MANOUSAKI V, PAPOUTSAKIS K, ARGYROS A. Graphing the future: Activity and next active object prediction using graph-based activity representations[J]. Advances in Visual Computing. 2022(13598): 299-312.
[22] RASOULI A, KOTSERUBA I, TSOTSOS J K. Pedestrian action antic-ipation using contextual feature fusion in stacked RNNs[J/OL]. https://doi.org/10.48550/arXiv.2005.06582, 2020.
[23] OSMAN N, CAMPORESE G, COSCIA P, et al. Slowfast roll-ing-unrolling LSTMS for action anticipation in egocentric videos [C]// International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 3437-3445.
[24] GIRDHAR R, GRAUMAN K. Anticipative video transformer [C]// International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 13505-13515.
[25] GU X, QIU J, GUO Y, et al. Transaction: ICL-SJTU submission to epic-kitchens action anticipation challenge 2021[J/OL]. https://doi.org/10.48550/arXiv.2107.13259, 2021.
[26] MIECH A, LAPTEV I, SIVIC J, et al. Leveraging the present to anticipate the future in videos [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2019: 2915-2922
[27] ZHANG T, MIN W, ZHU Y, et al. An egocentric action anticipation framework via fusing intuition and analysis [C]// Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 402-410.
[28] DESSALENE E, DEVARAJ C, MAYNORD M, et al. Forecasting action through contact representations from first person video[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2021, 45(6): 6703-6714.
[29] TAI T M, FIAMENI G, LEE C K, et al. Unified recurrence modeling for video action anticipation [C]// International Conference on Pattern Recognition. New York: IEEE, 2022: 3273-3279.
[30] 张天予, 闵巍庆, 韩鑫阳,等. 视频中的未来动作预测研究综述[J]. 计算机学报, 2023, 46(6): 1315-1338. (ZHANG Tianyu, MIN Weiqing, HAN Xinyang, et al. A Survey on Future Action Anticipation in Videos[J]. Chinese Journal of Computers, 2023, 46(6): 1315-1338)
[31] NI Z, MASCARO E V, AHN H, et al. Human-object interaction prediction in videos through gaze following[J]. Computer Vision and Image Understanding, 2023, 233: 103741.
[32] LIU M, TANG S, LI Y, et al. Forecasting human-object interaction: Joint prediction of motor attention and actions in first person video [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 704-721.
[33] THAKUR S, BEYAN C, MORERIO P, et al. Enhancing next active object-based egocentric action anticipation with guided attention [C]// International Conference on Image Processing. New York: IEEE, 2023: 1450-1454.
[34] WANG X, ZHANG S, QING Z, et al. OADTR: Online action detection with transformers[C] // International Conference on Computer Vision. Piscataway, NJ: IEEE, 2021: 7565-7575.
[35] GIRASE H, AGARWAL N, CHOI C, et al. Latency matters: Real-time action forecasting transformer[C] // Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 18759-18769.
[36] GUERMAL M, ALI A, DAI R, et al. JOADAA: Joint online action detection and action anticipation[C] // Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 6889-6898.
[37] Chen J, Li X, Cao J, et al. RHINO: Learning real-time humanoid-human-object interaction from human demonstrations[J/OL]. https://doi.org/10.48550/arXiv.2502.13134, 2025.
[38] FERNANDO B, HERATH S. Anticipating human actions by correlating past with the future with jaccard similarity measures [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2021: 13224-13233.
[39] ROY D, FERNANDO B. Action anticipation using latent goal learning [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2022: 2745-2753.
[40] XU X, LI Y L, LU C. Learning to anticipate future with dynamic context removal [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 12734-12744.
[41] 莫凌飞, 蒋红亮, 李煊鹏. 基于深度学习的视频预测研究综述[J]. 智能系统学报, 2018, 13(1): 85-96. (MO Lingfei, JIANG Hongliang, LI Xuanpeng. Review of deep learning-based video prediction[J]. CAAI Transactions on Intelligent Systems, 2018,13(1):85-96.)
[42] LIU T, LAM K M. A hybrid egocentric activity anticipation framework via memory-augmented recurrent and one-shot representation forecasting [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 13904-13913.
[43] WU C Y, LI Y, Mangalam K, et al. MeMVIT: Memory-augmented multiscale vision transformer for efficient long-term video recognition [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 13587-13597.
[44] DIKO A, AVOLA D, PRENKAJ B, et al. Semantically guided represen-tation learning for action anticipation [C]// European Conference on Computer Vision. Berlin: Springer, 2024: 448-466.
[45] CAO C, SUN Z, LYU Q, et al. VS-TransGRU: A novel transform-er-GRU-based framework enhanced by visual-semantic fusion for egocentric action anticipation[J]. IEEE Trans on Circuits and Systems for Video Technology. 2024, 34(11): 11605-11618.
[46] SENER F, SINGHANIA D, YAO A. Temporal aggregate representations for long-range video understanding [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 154-171.
[47] GUO H, AGARWAL N, LO S Y, et al. Uncertainty-aware action decoupling transformer for action anticipation [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2024: 18644-18654.
[48] WANG J, CHEN G, HUANG Y, et al. Memory-and-anticipation trans-former for online action understanding[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 13824-13835.
[49] ROY D, FERNANDO B. Predicting the next action by modeling the abstract goal[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2025: 162-177.
[50] QI Z, WANG S, ZHANG W, et al. Uncertainty-boosted robust video activity anticipation[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2024,46(12): 7775-7792.
[51] HAN X, ZHANG Z, DING N, et al. Pre-trained models: Past, present and future[J]. AI Open, 2021, 2: 225-250.
[52] VONDRICK C, PIRSIAVASH H, TORRALBA A. Anticipating visual representations from unlabeled video [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2016: 98-106.
[53] ZHONG Y, ZHENG W S. Unsupervised learning for forecasting action representations [C]// International Conference on Image Processing. New York: IEEE, 2018: 1073-1077.
[54] WU Y, ZHU L, WANG X, et al. Learning to anticipate egocentric actions by imagination[J]. IEEE Trans on Image Processing, 2020, 30: 1143-1152.
[55] GUPTA A, LIU J, BO L, et al. A-ACT: Action anticipation through cycle transformations[J/OL]. https://doi.org/10.48550/arXiv.2204.00942, 2022.
[56] ROTONDO T, FARINELLA G M, TOMASELLI V, et al. Action Antic-ipation from Multimodal Data [C]// 14th International Conference on Computer Vision Theory and Applications. Setubal: SCITEPRESS, 2019: 154-161.
[57] ZATSARYNNA O, ABU FARHA Y, GALL J. Multi-modal temporal convolutional network for anticipating actions in egocentric videos [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2021: 2249-2258.
[58] SHEN Y, NI B, LI Z, et al. Egocentric activity prediction via event modulated attention [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 197-212.
[59] MANOUSAKI V, BACHARIDIS K, PAPOUTSAKIS K, et al. VLMAH: Visual-linguistic modeling of action history for effective action anticipation [C]// International Conference on Image Processing. New York: IEEE, 2023: 1917-1927.
[60] ZHONG Z, SCHNEIDER D, VOIT M, et al. Anticipative feature fusion transformer for multi-modal action anticipation [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2023: 6068-6077.
[61] KIM M H, JUNG J W, LEE E G, et al. Disentangled adaptive fusion transformer using adversarial perturbation for egocentric action antici-pation[J]. Expert Systems with Applications, 2025: 127648.
[62] GHOSH S, AGGARWAL T, HOAI M, et al. Text-derived knowledge helps vision: A simple cross-modal distillation for video-based action anticipation[J/OL]. https://doi.org/10.48550/arXiv.2210.05991, 2022.
[63] WANG S, ZHANG C, WANG L, et al. Long and short-term collaborative decision-making transformer for online action detection and anticipation[J]. Pattern Recognition, 2025: 111773.
[64] XU M, XIONG Y, CHEN H, et al. Long short-term transformer for online action detection[J]. Advances in Neural Information Processing Systems, 2021, 34: 1086-1099.
[65] NAGARAJAN T, LI Y, FEICHTENHOFER C, et al. Ego-topo: Envi-ronment affordances from egocentric video[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2020: 163-172.
[66] HUANG Y, YANG X, XU C. Multimodal global relation knowledge distillation for egocentric action anticipation [C]// Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 245-254.
[67] CHANG C Y, HUANG D A, XU D, et al. Procedure planning in in-structional videos [C]// European Conference on Computer Vision. Berlin: Springer, 2020: 334-350.
[68] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: The epic-kitchens dataset [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 720-736.
[69] GIBSON J J. The theory of affordances [M]// JEN J G, WILLIAM M, CINDI K, et al. The people, place, and space reader. London: Routledge, 2014: 56-60.
[70] DO T T, NGUYEN A, REID I. Affordancenet: An end-to-end deep learning approach for object affordance detection [C]// IEEE International Conference on Robotics and Automatio. New York: IEEE, 2018: 5882-5889.
[71] MYERS A, TEO C L, FERMULLER C, et al. Affordance detection of tool parts from geometric features[C]// IEEE International Conference on Robotics and Automation. Los Alamitos: IEEE, 2015: 1374-1381.
[72] NGUYEN A, KANOULAS D, CALDWELL D G, et al. Object-based affordances detection with convolutional neural networks and dense conditional random fields [C]// IEEE/RSJ International Conference on Intelligent Robots and Systems. New York: IEEE, 2017: 5908-5915.
[73] NAGARAJAN T, FEICHTENHOFER C, GRAUMAN K. Grounded human-object interaction hotspots from video[C]// International Conference on Image Processing. New York: IEEE, 2019: 8688-8697.
[74] LUO H, ZHAI W, ZHANG J, et al. Learning visual affordance grounding from demonstration videos[J]. IEEE Trans on Neural Networks and Learning Systems, 2023,35(11): 16857-16871.
[75] LI G, JAMPANI V, SUN D, et al. Locate: Localize and transfer object parts for weakly supervised affordance grounding [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2023: 10922-10931.
[76] ROY D, RAJENDIRAN R, FERNANDO B. Interaction region visual transformer for egocentric action anticipation [C]// Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 6740-6750.
[77] LIU S, TRIPATHI S, MAJUMDAR S, et al. Joint hand motion and interaction hotspots prediction from egocentric videos[C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2022: 3282-3292.
[78] JIANG J, NAN Z, CHEN H, et al. Predicting short-term next-active-object through visual attention and hand position[J]. Neurocomputing, 2021, 433: 212-222.
[79] GUAN J, YUAN Y, KITANI K M, et al.. Generative Hybrid Represen-tations for Activity Forecasting With No-Regret Learning// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2020: 170-179.
[80] FATHI A, REN X, REHG J M. Learning to recognize objects in egocentric activities [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2011: 3281-3288.
[81] LI Y, YE Z, REHG J M. Delving into egocentric actions [C]// Conference on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE Press, 2015: 287-295.
[82] FATHI A, LI Y, REHG J M. Learning to recognize daily actions using gaze [C]// Proc. Eur. Conf. Comput. Vis. Berlin: Springer, 2012: 314-327.
[83] LI Y, LIU M, REHG J M. In the eye of beholder: Joint learning of gaze and actions in first person video [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 619-635.
[84] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Scaling egocentric vision: The epic-kitchens dataset [C]// European Conference on Computer Vision. Berlin: Springer, 2018: 720-736.
[85] DAMEN D, DOUGHTY H, FARINELLA G M, et al. Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100[J]. International Journal of Computer Vision, 2022: 1-23.
[86] SONG Y, BYRNE E, NAGARAJAN T, et al. Ego4d goal-step: Toward hierarchical understanding of procedural activities[J]. Advances in Neural Information Processing Systems, 2023, 36: 38863-38886.
[87] FURNARI A, FARINELLA G M. What would you expect? Anticipating egocentric actions with rolling-unrolling LSTMS and modality attention [C]// International Conference on Image Processing. New York: IEEE, 2019: 6252-6261.
[88] DESSALENE E, MAYNORD M, DEVARAJ C, et al. Egocentric object manipulation graphs[J/OL]. https://doi.org/10.48550/arXiv.2006.03201, 2020.
[89] ROY D, FERNANDO B. Action anticipation using pairwise human-object interactions and transformers[J]. IEEE Transactions on Image Processing, 2021, 30: 8116-8129.
[90] CAMPORESE G, COSCIA P, FURNARI A, et al. Knowledge distillation for action anticipation via label smoothing[C]// International Conference on Pattern Recognition. New York: IEEE, 2021: 3312-3319.
[91] QI Z, WANG S, SU C, et al. Self-regulated learning for egocentric video activity anticipation[J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2021, 45(6): 6715-6730.
[92] LIU X, HAO C, YU Z, et al. From recognition to prediction: Leveraging sequence re asoning for action anticipation[J]. ACM Transactions on Multimedia Computing, Communications and Applications, 2024, 20(11): 1-19.
[93] THAKUR S, BEYAN C, MORERIO P, et al. Leveraging next-active objects for context-aware anticipation in egocentric videos [C]// Pro-ceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. Piscataway, NJ: IEEE Press, 2024: 8657-8666.
[94] TONG Z, SONG Y, WANG J, et al. VideoMAE: Masked autoencoders are data-efficient learners for self-supervised video pre-training[J]. Advances in Neural Information Processing Systems, 2022, 35: 10078-10093.
|