作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (2): 266-272. doi: 10.19678/j.issn.1000-3428.0067206

• 图形图像处理 • 上一篇    下一篇

基于潜在特征增强网络的视频描述生成方法

李伟健*(), 胡慧君   

  1. 武汉科技大学计算机科学与技术学院, 湖北 武汉 430065
  • 收稿日期:2023-03-20 出版日期:2024-02-15 发布日期:2023-06-08
  • 通讯作者: 李伟健
  • 基金资助:
    国家自然科学基金(62271359)

Video Description Generation Method Based on Latent Feature Augmented Network

Weijian LI*(), Huijun HU   

  1. School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, Hubei, China
  • Received:2023-03-20 Online:2024-02-15 Published:2023-06-08
  • Contact: Weijian LI

摘要:

视频描述生成旨在用自然语言描述视频中的物体及其相互作用。现有方法未充分利用视频中的时空语义信息,限制了模型生成准确描述语句的能力。为此,提出一种用于视频描述生成的潜在特征增强网络(LFAN)模型。利用不同的特征提取器提取外观特征、运动特征和目标特征,将对象级的目标特征分别和帧级的外观特征与运动特征融合,同时对融合后的不同特征进行增强,在生成描述前利用图神经网络和长短时记忆网络推理对象之间的时空关系,从而得到具有时空信息和语义信息的潜在特征,同时使用长短时记忆网络和门控循环单元的解码器生成视频的描述语句。该网络模型能够准确地学习到对象特征,进而引导生成更准确的词汇及与对象之间的关系。在MSVD和MSR-VTT数据集上的实验结果表明,LFAN模型可以显著提高生成描述语句的准确性,并与视频中的内容呈现出更好的语义一致性,在MSVD数据集上的BLEU@4和ROUGE-L分数分别为57.0和74.1,在MSR-VTT数据集上分别为43.8和62.1。

关键词: 视频描述生成, 潜在特征增强网络, 时空语义信息, 图神经网络, 特征融合

Abstract:

Video description generation aims to use natural language to describe objects and their interactions in videos. The existing methods do not fully utilize the spatio-temporal semantic information in videos, which limits the model's ability to generate accurate descriptive statements. To this end, a Latent Feature Augmented Network(LFAN) model is proposed for video description generation.Different feature extractors are used to extract appearance, motion, and target features, thereby fusing object level target features with frame level appearance and motion features. Concurrently, the fused different features are enhanced. Before generating descriptions, graph neural and long short-term memory networks are used to infer the spatio-temporal relationships between objects, thereby obtaining potential features with spatio-temporal and semantic information. Finally, a decoder using both a long short-term memory network and a gated loop unit is used to generate a description statement for the video. This network model can accurately learn object features and guide the generation of more accurate vocabulary and relationships with objects.The experimental results on MSVD and MSR-VTT datasets show that the LFAN model can significantly improve the accuracy of generating descriptive statements, exhibiting better semantic consistency with the content in the video. The BLEU@4 and ROUGE-L scores are 57.0 and 74.1 on MSVD, respectively, and 43.8 and 62.1 on the MSR-VTT dataset.

Key words: video description generation, latent feature augmented network, spatio-temporal semantic information, graph neural networks, feature fusion