全局和局部概念引导的多模态视频描述方法

doi:10.19678/j.issn.1000-3428.0252367

摘要/Abstract

摘要： 视频描述旨在深入分析视频内容，用自然语言准确、流畅的描述视频内容。概念，对应于视频内容中的对象、动作和属性，可以作为视频描述的媒介。虽然使用概念引导视频描述已经有部分研究，但是仍然存在着两个主要的问题，概念检测精度有限和概念利用率不足。针对这些问题，提出了全局和局部概念引导的多模态视频描述方法(CGMVC)，来提高生成描述的质量。首先用不同的骨干网络提取视频的多模态特征，利用HMMC模型通过分层匹配的视频到文本检索提供视频的文本信息，然后使用多模态特征融合和概念检测网络精确检测概念。为了充分利用检测到的概念，通过概念投影模块挖掘视频的潜在主题从全局层面引导解码，通过语义注意力模块和交叉注意力模块分别利用概念和视频的多模态特征，实现局部层面的解码优化。通过充分利用概念和不同模态的信息，生成更加自然和准确的描述。在MSVD和MSR-VTT数据集上CGMVC模型的CIDEr和BLEU@4分别达到了111.2%、57.1%和64.1%、51.2%,对比和消融实验结果表明，CGMVC方法相对于基线方法和其他先进方法的优越性。

Abstract: Video captioning aims to deeply analyze video content and accurately and fluently describe it in natural language. Concepts, corresponding to objects, actions, and attributes in video content, can serve as a medium for video captioning. Although some studies have explored concept-guided video captioning, two main issues remain, limited concept detection accuracy and insufficient concept utilization. To address these issues, this paper proposes a multimodal video captioning approach guided by global and local concepts (CGMVC) to improve the quality of generated descriptions. First it extracts multimodal features of videos using different backbone networks. It leverages HMMC model via hierarchical matching video-to-text retrieval to provide textual information from videos. Then, it uses multimodal feature fusion and concept detection network to precisely detect concepts. To fully utilize the detected concepts, concept projection module is employed to uncover the latent themes of videos to globally guide decoding, while semantic attention module and cross attention module are used to locally optimize decoding by leveraging concepts and multimodal features of videos. By fully utilizing concepts and information from different modalities, more natural and accurate descriptions are generated. Experiments on the MSVD and MSR-VTT datasets show that the CGMVC model achieves CIDEr scores of 111.2% and 64.1%, and BLEU@4 scores of 57.1% and 51.2%, respectively. Comparative and ablation studies demonstrate the superiority of the CGMVC method over baseline approaches and other state-of-the-art methods.

孔钰龙, 蔺素珍, 晋赞霞. 全局和局部概念引导的多模态视频描述方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252367.

KONG Yulong, LIN Suzhen, JIN Zanxia. Multimodal Video Captioning Approach Guided by Global and Local Concepts[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252367.

参考文献

[1]李伟健, 胡慧君. 基于潜在特征增强网络的视频描述生成方法[J]. 计算机工程, 2024, 50(2): 266-272. Li W J, Hu H J. Video description generation method based on latent feature augmented network[J]. Computer Engineering, 2024, 50(2): 266-272.（in Chinese）
[2]张浩萌, 刘斌. 融合语义信息和视觉推理特征的视频描述方法[J]. 小型微型计算机系统, 2024, 45(2): 470-476. Zhang H M, Liu B. Video captioning method fusing semantic information and visual reasoning features[J]. Journal of Chinese Computer Systems, 2024, 45(2): 470-476.（in Chinese）
[3]Liu Y, Zhu H, Wu Z, et al. Adaptive semantic guidance network for video captioning[J]. Computer Vision and Image Understanding, 2025, 251: 104255-104265.
[4]Zeng P, Zhang H, Gao L, et al. Visual commonsense-aware representation network for video captioning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 1092-1103.
[5]Shen W, Song J, Zhu X, et al. End-to-end pre-training with hierarchical matching and momentum contrast for text-video retrieval[J]. IEEE Transactions on Image Processing, 2023, 32: 5017-5030.
[6]Krishnamoorthy N, Malkarnenkar G, Mooney R, et al. Generating natural-language video descriptions using text-mined knowledge[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2013, 27(1): 541-547.
[7]Jing S, Zhang H, Zeng P, et al. Memory-based augmentation network for video captioning[J]. IEEE Transactions on Multimedia, 2023, 26: 2367-2379.
[8]Tu Y, Zhou C, Guo J, et al. Relation-aware attention for video captioning via graph learning[J]. Pattern Recognition, 2023, 136: 109-204.
[9]Aming W, Yahong H, Yi Y, et al. Convolutional reconstruction-to-sequence for video captioning.[J], IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(11): 4299-4308.
[10]Liqi Y, Siqi M, Qifan W, et al. Video captioning using global-local representation[J], IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6642-6656.
[11]Pan B, Cai H, Huang D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF conference on
computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2020: 10870-10879. [12]Zhang Z, Shi Y, Yuan C, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288.
[13]Gao L, Lei Y, Zeng P, et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering[J]. IEEE Transactions on Image Processing, 2021, 31: 202-215.
[14]Tu Y, Zhou C, Guo J, et al. Enhancing the alignment between target words and corresponding frames for video captioning[J]. Pattern Recognition, 2021, 111: 107-702.
[15]Wu B, Niu G, Yu J, et al. Towards knowledge-aware video captioning via transitive visual relationship detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6753-6765.
[16]Zhong X, Li Z, Chen S, et al. Refined semantic enhancement towards frequency diffusion for video captioning[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, CA: AAAI Press, 2023, 37(3): 3724-3732.
[17]Zhang H, Gao L, Zeng P, et al. Depth-aware sparse transformer for video-language learning[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York, USA: ACM Press, 2023: 4778-4787.
[18]Gu X, Chen G, Wang Y, et al. Text with knowledge graph augmented transformer for video captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2023: 18941-18951.
[19] Li L, Gao X, Deng J, et al. Long short-term relation transformer with global gating for video captioning[J]. IEEE Transactions on Image Processing, 2022, 31: 2726-2738.
[20]Zhao H, Chen Z, Yang Y. Multi-scale features with temporal information guidance for video captioning[J]. Engineering Applications of Artificial Intelligence, 2024, 137: 109-102.
[21]Yang B, Cao M, Zou Y. Concept-aware video captioning: Describing videos with effective prior information[J]. IEEE Transactions on Image Processing, 2023, 32: 5366-5378.
[22]Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2017: 3165-3173.
[23]Xu Y, Yang J, Mao K. Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature[J]. Neurocomputing, 2019, 357: 24-35.
[24]Sun L, Li B, Yuan C, et al. Multimodal semantic attention network for video captioning[C]//2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019: 1300-1305.
[25]Gabeur V, Sun C, Alahari K, et al. Multi-modal transformer for video retrieval[C]//European Conference on Computer Vision. Berlin, German: Springer Press, 2020: 214-229.
[26]Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2021: 5075-5084.
[27]Lei J, Li L, Zhou L, et al. Less is more: Clipbert for video-and-language learning via sparse sampling[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2021: 7327-7337.
[28]Zhao S, Zhu L, Wang X, et al. Centerclip: Token clustering for efficient text-video retrieval[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2022: 970-981.
[29]Wang X, Zhu L, Zheng Z, et al. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision[J]. IEEE Transactions on Multimedia, 2022, 25: 6079-6089.
[30]Luo X, Luo X, Wang D, et al. Global semantic enhancement network for video captioning[J]. Pattern Recognition, 2024, 145: 109-906.
[31]Ryu H, Kang S, Kang H, et al. Semantic grouping network for video captioning[C]//proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2021, 35(3): 2514-2522.
[32]Chen S, Jiang Y G. Motion guided region message passing for video captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. Washington D. C. , USA: IEEE Press, 2021: 1543-1552.

选择文件类型/文献管理软件名称

选择包含的内容