[1]Li J, Chen D, Hong Y, et al. Covlm: Composing visual entities and relationships in large language models via communicative decoding[J]. arXiv preprint arXiv:2311.03354, 2023.
[2]Zhong Z, Schneider D, Voit M, et al. Anticipative feature fusion transformer for multi-modal action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 6068-6077.
[3]张天予,闵巍庆,韩鑫阳,等.视频中的未来动作预测研究综述[J].计算机学报, 2023, 46(6):1315-1338.
Zhang Tianyu, Min Weiqing, Han Xinyang, et al.A Survey on Future Action Anticipation in Video[J].CHINESE JOURNAL OF COMPUTERS, 2023, 46(6):1315-1338.(in Chinese)
[4]Grauman K, Westbury A, Byrne E, et al. Ego4d: Around the world in 3,000 hours of egocentric video[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 18995-19012.
[5]Zhang C, Gupta A, Zisserman A. Helping hands: An object-aware ego-centric video recognition model[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 13901-13912.
[6]Zhang C, Fu C, Wang S, et al. Object-centric video representation for long-term action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024: 6751-6761.
[7]Qi Z, Wang S, Zhang W, et al. Uncertainty-boosted robust video activity anticipation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 7775-7792.
[8]Qiu Y, Rajan D. Action Sequence Augmentation for Action Anticipation[C]//The Thirteenth International Conference on Learning Representations.
[9]Das S, Ryoo M S. Video+ clip baseline for ego4d long-term action anticipation[J]. arXiv preprint arXiv:2207.00579, 2022.
[10]Bao W, Yu Q, Kong Y. Uncertainty-based traffic accident anticipation with spatio-temporal relational learning[C]//Proceedings of the 28th ACM International Conference on Multimedia. 2020: 2682-2690.
[11]Zhao Q, Wang S, Zhang C, et al. Antgpt: Can large language models help long-term action anticipation from videos?[J]. arXiv preprint arXiv:2307.16368, 2023.
[12]Huang D, Hilliges O, Van Gool L, et al. Palm: Predicting actions through language models@ ego4d long-term action anticipation challenge 2023[J]. arXiv preprint arXiv:2306.16545, 2023.
[13]Mittal H, Agarwal N, Lo S Y, et al. Can't make an omelette without breaking some eggs: Plausible action anticipation using large video-language models[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 18580-18590.
[14]Zhong Z, Martin M, Voit M, et al. A survey on deep learning techniques for action anticipation[J]. arXiv preprint arXiv:2309.17257, 2023.
[15]Qi Z, Wang S, Zhang W, et al. Uncertainty-aware Mixture of Experts for Video Action Anticipation[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2025.
[16]Qi Z, Wang S, Su C, et al. Self-regulated learning for egocentric video activity anticipation[J]. IEEE transactions on pattern analysis and machine intelligence, 2021, 45(6): 6715-6730.
[17] 胡佛,沈浩铭,杨旭升,等.基于多分支混合图时空卷积网络的骨架运动预测[C]//2023中国自动化大会.浙江工业大学信息工程学院, 2023.
Hu Fo, Shen Haoming, Yang Xusheng, et al. Skeleton Motion Prediction Based on Multi-Branch Hybrid Graph Spatiotemporal Convolutional Network [C] // Proceedings of the 2023 Chinese Automation Congress. School of Information Engineering, Zhejiang University of Technology, 2023.(in Chinese)
[18]黄金贵,黄一举.基于注意力时空解耦3D卷积LSTM的视频预测[J].微电子学与计算机, 2022(009):039.
Huang Jinguì, Huang Yiju. Video Prediction Based on Attention Spatiotemporal Decoupled 3D Convolutional LSTM [J]. Microelectronics & Computer, 2022(009): 039.(in Chinese)
[19]Nawhal M, Jyothi A A, Mori G. Rethinking learning approaches for long-term action anticipation[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 558-576.
[20]Pei B, Chen G, Xu J, et al. Egovideo: Exploring egocentric foundation model and downstream adaptation[J]. arXiv preprint arXiv:2406.18070, 2024.
[21]Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763.
[22]Mascaró E V, Ahn H, Lee D. Intention-conditioned long-term human egocentric action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 6048-6057.
[23]Zach C, Pock T, Bischof H. A duality based approach for realtime tv-l 1 optical flow[C]//Joint pattern recognition symposium. Berlin, Heidelberg: Springer Berlin Heidelberg, 2007: 214-223.
[24]Wu C Y, Krahenbuhl P. Towards long-form video understanding[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 1884-1894.
[25]Roy D, Fernando B. Action anticipation using pairwise human-object interactions and transformers[J]. IEEE Transactions on Image Processing, 2021, 30: 8116-8129.
[26]Qi S, Wang W, Jia B, et al. Learning human-object interactions by graph parsing neural networks[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 401-417.
[27]Roy D, Rajendiran R, Fernando B. Interaction region visual transformer for egocentric action anticipation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2024: 6740-6750.
[28]Liang J, Jiang L, Niebles J C, et al. Peeking into the future: Predicting future person activities and locations in videos[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2019: 5725-5734.
[29]Tan K, Qi Z, Zhong J, et al. KN-VLM: KNowledge-guided Vision-and-Language Model for visual abductive reasoning[J]. Multimedia Systems, 2025, 31(2): 146.
[30]Liu M, Tang S, Li Y, et al. Forecasting human-object interaction: joint prediction of motor attention and actions in first person video[C]//European conference on computer vision. Cham: Springer International Publishing, 2020: 704-721.
[31]Li Y, Liu M, Rehg J M. In the eye of beholder: Joint learning of gaze and actions in first person video[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 619-635.
[32]Huang Y, Chen G, Xu J, et al. Egoexolearn: A dataset for bridging asynchronous ego-and exo-centric view of procedural activities in real world[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024: 22072-22086.
[33]Nagarajan T, Li Y, Feichtenhofer C, et al. Ego-topo: Environment affordances from egocentric video[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020: 163-172.
[34]Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[35]Xiao F, Kundu K, Tighe J, et al. Hierarchical self-supervised representation learning for movie understanding[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 9727-9736.
[36]Chen C, Qin R, Luo F, et al. Position-enhanced visual instruction tuning for multimodal large language models[J]. arXiv preprint arXiv:2308.13437, 2023.
[37]Zhang H, Ee Y K, Fernando B. Rca: Region conditioned adaptation for visual abductive reasoning[C]//Proceedings of the 32nd ACM International Conference on Multimedia. 2024: 9455-9464.
[38]Shen Y, Ni B, Li Z, et al. Egocentric activity prediction via event modulated attention[C]//Proceedings of the European conference on computer vision (ECCV). 2018: 197-212.
[39]Herzig R, Ben-Avraham E, Mangalam K, et al. Object-region video transformers[C]//Proceedings of the ieee/cvf conference on computer vision and pattern recognition. 2022: 3148-3159.
[40]Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?[C]//Icml. 2021, 2(3): 4.
[41]Yang L, Zhang S, Yu Z, et al. Supervised Knowledge Makes Large Language Models Better In-context Learners[C]//ICLR. 2024.
[42]Feichtenhofer C, Fan H, Malik J, et al. Slowfast networks for video recognition[C]//Proceedings of the IEEE/CVF international conference on computer vision. 2019: 6202-6211.
[43]Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.
[44]Stein S, McKenna S J. Combining embedded accelerometers with computer vision for recognizing food preparation activities[C]//Proceedings of the 2013 ACM international joint conference on Pervasive and ubiquitous computing. 2013: 729-738.
[45]Abu Farha Y, Richard A, Gall J. When will you do what?-anticipating temporal occurrences of activities[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2018: 5343-5352.
[46]Guo D, Yang D, Zhang H, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning[J]. arXiv preprint arXiv:2501.12948, 2025.
[47]Gong D, Lee J, Kim M, et al. Future transformer for long-term action anticipation[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3052-3061.
[48]Ashutosh K, Girdhar R, Torresani L, et al. Hiervl: Learning hierarchical video-language embeddings[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 23066-23078.
|