作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于大小模型协同的长时动作预测方法

  • 发布日期:2025-12-02

Long-Term Action Prediction Method Based on Collaboration Between Large and Small Models

  • Published:2025-12-02

摘要: 长时动作预测作为计算机视觉领域的重要任务之一,旨在基于第一人称视角视频,预测视频主体在未来长时序范围内可能执行的动作序列。该任务的挑战在于未来行为具有不确定性,视频中的行动者在相同情境下可能遵循多种合理的行动轨迹,而数据集中的视频样本往往只涉及其中一种,限制了模型对多样性的学习。此外,模型输入的视频片段较短,而需预测较长的未来过程,观测信息不足与推理跨度较大的矛盾进一步加剧了预测难度。为缓解上述挑战,提出了一种名为引导协同网络的预测框架,该框架基于大小模型协同的机制,由小模型预测模块和大模型辅助模块构成,两个模块分别承担预测建模与预测空间约束的功能。小模型由视觉编码器、 提取器以及动作预测器组成,依次对输入视频进行编码、视觉辅助信息提取,并生成未来动作的预测分布。视觉辅助信息提取器通过融合手部线索与物体区域特征,引入交叉注意力机制,建模手部和物体的交互关系。大模型辅助模块基于大型语言模型,识别当前场景中出现可能性较低的物体名词,并将其用于约束小模型的预测器,通过对预测空间中不合理选项进行屏蔽,有效提升预测的准确性与合理性。此外,对损失函数进行了优化,设计名词时序平滑损失,约束预测名词的分布,使其在时序上具有连贯性。在Ego4D和50Salads数据集上进行验证评估。实验结果表明,在Ego4D数据集上与基线模型相比引导协同网络在名词的预测上取得了8.9%的改进,在动词上取得4.2%的改进。

Abstract: Long-term action anticipation, as a crucial task in computer vision, aims to predict the sequence of actions a person is likely to perform in the distant future based on first-person video. The main challenge of this task lies in the inherent uncertainty of future behaviors—actors in similar contexts may follow multiple plausible action trajectories, while most video samples in existing datasets typically cover only one. This limits the model’s ability to learn action diversity. Moreover, the input video segments are relatively short compared to the extended range of future prediction, further exacerbating the difficulty due to the contradiction between insufficient observations and long-range reasoning.To address these challenges, we propose a predictive framework named Vision and LLM Cooperative Network (ViLLCoNet), which is based on a cooperative mechanism between a lightweight model and a large-scale model. These two modules are responsible for predictive modeling and constraining the prediction space, respectively. The lightweight model comprises a visual encoder, a visual auxiliary information extractor, and an action predictor. It encodes the input video, extracts visual auxiliary cues, and generates the future action distribution. The visual auxiliary extractor introduces a cross-attention mechanism to capture interactions between hands and object regions by fusing hand cues and object features.The large-scale auxiliary module, built upon a large language model, identifies low-probability object nouns in the current scene and uses them to constrain the predictor of the lightweight model. By masking semantically implausible candidates in the prediction space, this mechanism improves both accuracy and plausibility of predictions. In addition, the loss function is optimized by introducing a noun temporal smoothing loss, which constrains the predicted noun distribution to exhibit temporal coherence. The proposed method is evaluated on the Ego4D and 50Salads datasets. Experimental results demonstrate that, compared with the baseline model, the proposed ViLLCoNet achieves an 8.9% improvement in noun prediction and a 4.2% improvement in verb prediction on the Ego4D dataset.