Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Video Text Semantic Alignment and Full Video Dependency for Weakly Supervised Action Localization

  

  • Published:2025-07-22

视频文本语义对齐与全视频依赖的弱监督动作定位

Abstract: In response to the challenges in existing weakly supervised temporal action localization research, such as underutilization of action temporal characteristics, global properties, and action semantic consistency, a method is proposed that video text semantic alignment and full video dependency (FVD-ALM). Firstly, dilated convolutions network to expand the model's receptive field and attention mechanisms are utilized to precisely enhance the temporal features of action instances, ensuring accurate temporal feature extraction. Then, an expectation maximization algorithm based on Gaussian mixture model is applied to extract and enhance global information from the video, generating accurate temporal class activation maps to aid in the action localization process. Finally, video-text semantic alignment module is designed to comprehensively understand actions by combining the textual information in action labels. The model is trained to complete the textual descriptions of actions, thereby enhancing its cognitive ability of action category consistency and effectively distinguishing different action categories. Experimental results on the THUMOS14 and ActivityNet1.3 datasets confirm the effectiveness of this method, achieving gains 39.1% in terms of average mAP on THUMOS14, which is 2.0 percentage points improvement over the DTRP-Loc method. This demonstrates that the method of integrating multi-source information significantly improves the accuracy of action localization and provides an effective solution for weakly supervised action localization tasks.

摘要: 针对现有弱监督动作定位研究存在的未充分利用动作的时序特性、全局特性和动作语义一致性等问题,提出视频文本语义对齐与全视频依赖的方法(FVD-ALM),充分利用多源信息以提升动作定位的准确性和鲁棒性。首先,依托膨胀卷积扩大模型的感受野,结合注意力机制对视频内动作的变化实施精确的特征强化,确保获得准确的时序特征,捕捉动作的动态变化。然后,采用基于高斯混合模型的期望最大化算法提取并强化视频中的全局信息,生成精确的时序激活图,理解视频的整体内容,辅助动作的定位过程。最后,设计视频文本语义对齐模块,结合动作标签中的文本信息全面理解动作,训练模型补全描述动作的文本信息,增强模型对动作类别一致性的认知并有效区分不同动作类别。实验结果表明,在THUMOS14和ActivityNet1.3这两个主流数据集上,该方法均有效,其中在THUMOS14上实现了39.1%平均mAP,比DTRP-Loc方法提高了2.0个百分点,证实了结合多源信息的方法能够显著提高动作定位的准确性,为弱监督动作定位任务提供了一种有效的解决方案。