作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 31-52. doi: 10.19678/j.issn.1000-3428.0252357

• 前沿观点与综述 • 上一篇    下一篇

短期动作预测深度学习方法综述

孙海峰1, 姚俊萍1, 李晓军1,*(), 刘延飞2, 辜弘炀1   

  1. 1. 火箭军工程大学作战保障学院, 陕西 西安 710025
    2. 火箭军工程大学基础部, 陕西 西安 710025
  • 收稿日期:2025-04-22 修回日期:2025-07-20 出版日期:2026-06-15 发布日期:2025-08-21
  • 通讯作者: 李晓军
  • 作者简介:

    孙海峰, 男, 博士研究生, 主研方向为动作识别、人-物交互检测

    姚俊萍, 教授、博士

    李晓军(通信作者), 副教授、博士

    刘延飞, 教授、博士

    辜弘炀, 讲师、博士

  • 基金资助:
    国家自然科学基金(62401609); 中国博士后基金(2024M754275); 陕西省自然科学基础研究计划项目(2025JC-YBMS-783)

Review of Deep Learning Methods for Short-Term Action Anticipation

SUN Haifeng1, YAO Junping1, LI Xiaojun1,*(), LIU Yanfei2, GU Hongyang1   

  1. 1. School of Operational Support, Rocket Force University of Engineering, Xi'an 710025, Shaanxi, China
    2. Department of Basic, Rocket Force University of Engineering, Xi'an 710025, Shaanxi, China
  • Received:2025-04-22 Revised:2025-07-20 Online:2026-06-15 Published:2025-08-21
  • Contact: LI Xiaojun

摘要:

短期动作预测作为视频理解领域的重要任务, 旨在通过建模历史动作的时空与语义特征, 将观测到的物理动作转化为对动作意图和目标的推断, 精准预测未来数秒内的交互行为, 在人机协作、安防监控、自动驾驶、增强现实等领域具有广泛应用前景。近年来, 特征提取模型的革新与高质量数据集的构建共同推动了视频理解领域的发展, 并使短期动作预测从知识驱动的机器学习范式转向数据驱动的深度学习范式。系统回顾了该领域在深度学习方法中的最新技术, 以期为相关研究及场景应用分析提供借鉴和参考。首先, 从模型架构创新、训练策略应用与上下文建模方法3个维度构建分类体系, 分析领域内关键技术与挑战, 并对每类方法的特点、适用场景及研究进展进行阐述。然后, 简要归纳任务中常用的数据集并梳理多种方法在主流数据集上的性能对比。最后, 提出当前面临的挑战, 从多视角协同预测、实时模型推理验证、弱监督未裁剪数据学习、小样本类增量泛化研究、动态开放场景自适应、可变时间间隔预测等未来可能的研究方向进行展望。

关键词: 视频理解, 短期动作预测, 语义动作, 深度学习, 训练策略

Abstract:

Short-term action anticipation, a crucial task in video understanding, involves transforming observed physical motions into inferences about action intentions and goals by modeling the spatiotemporal and semantic features of historical actions. It enables the precise prediction of interactive behaviors within the next few seconds and has broad application prospects in human-machine collaboration, security surveillance, autonomous driving, and augmented reality. Recent advances in deep learning, particularly innovations in feature extraction models and the construction of high-quality datasets within the field of video understanding, have propelled the development of this domain. This progress has shifted short-term action anticipation has transitioned from a knowledge-driven machine learning paradigm to a data-driven deep learning paradigm. This survey systematically reviews the latest advancements in deep learning methods for short-term action anticipation, providing references and insights for related research and practical application analysis. For this purpose, a classification framework is first constructed from three perspectives: model architecture innovation, training strategy application, and contextual modeling methods. Within this framework, key technologies and challenges in the field are analyzed, and the characteristics, applicable scenarios, and research progress of each method category are elaborated. Next, datasets commonly used for this task are summarized, and the performances of various methods are compared on mainstream datasets. Finally, the current challenges and future research directions are outlined, including multi-view collaborative prediction, real-time model inference verification, weakly supervised learning from untrimmed data, few-shot class-incremental generalization, dynamic open-scene adaptation, and variable time interval prediction.

Key words: video understanding, short-term action anticipation, semantic action, deep learning, training strategy