作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

全局和局部概念引导的多模态视频描述方法

  • 发布日期:2025-09-02

Multimodal Video Captioning Approach Guided by Global and Local Concepts

  • Published:2025-09-02

摘要: 视频描述旨在深入分析视频内容,用自然语言准确、流畅的描述视频内容。概念,对应于视频内容中的对象、动作和属性,可以作为视频描述的媒介。虽然使用概念引导视频描述已经有部分研究,但是仍然存在着两个主要的问题,概念检测精度有限和概念利用率不足。针对这些问题,提出了全局和局部概念引导的多模态视频描述方法(CGMVC),来提高生成描述的质量。首先用不同的骨干网络提取视频的多模态特征,利用HMMC模型通过分层匹配的视频到文本检索提供视频的文本信息,然后使用多模态特征融合和概念检测网络精确检测概念。为了充分利用检测到的概念,通过概念投影模块挖掘视频的潜在主题从全局层面引导解码,通过语义注意力模块和交叉注意力模块分别利用概念和视频的多模态特征,实现局部层面的解码优化。通过充分利用概念和不同模态的信息,生成更加自然和准确的描述。在MSVD和MSR-VTT数据集上CGMVC模型的CIDEr和BLEU@4分别达到了111.2%、57.1%和64.1%、51.2%,对比和消融实验结果表明,CGMVC方法相对于基线方法和其他先进方法的优越性。

Abstract: Video captioning aims to deeply analyze video content and accurately and fluently describe it in natural language. Concepts, corresponding to objects, actions, and attributes in video content, can serve as a medium for video captioning. Although some studies have explored concept-guided video captioning, two main issues remain, limited concept detection accuracy and insufficient concept utilization. To address these issues, this paper proposes a multimodal video captioning approach guided by global and local concepts (CGMVC) to improve the quality of generated descriptions. First it extracts multimodal features of videos using different backbone networks. It leverages HMMC model via hierarchical matching video-to-text retrieval to provide textual information from videos. Then, it uses multimodal feature fusion and concept detection network to precisely detect concepts. To fully utilize the detected concepts, concept projection module is employed to uncover the latent themes of videos to globally guide decoding, while semantic attention module and cross attention module are used to locally optimize decoding by leveraging concepts and multimodal features of videos. By fully utilizing concepts and information from different modalities, more natural and accurate descriptions are generated. Experiments on the MSVD and MSR-VTT datasets show that the CGMVC model achieves CIDEr scores of 111.2% and 64.1%, and BLEU@4 scores of 57.1% and 51.2%, respectively. Comparative and ablation studies demonstrate the superiority of the CGMVC method over baseline approaches and other state-of-the-art methods.