作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (10): 45-54. doi: 10.19678/j.issn.1000-3428.0063294

• 热点与综述 • 上一篇    下一篇

基于多模态融合与多层注意力的视频内容文本表述研究

赵宏, 郭岚, 陈志文, 郑厚泽   

  1. 兰州理工大学 计算机与通信学院, 兰州 730050
  • 收稿日期:2021-11-19 修回日期:2021-12-27 发布日期:2021-12-30
  • 作者简介:赵宏(1971—),男,教授、博士,主研方向为计算机视觉、自然语言处理、深度学习;郭岚(通信作者)、陈志文、郑厚泽,硕士研究生。
  • 基金资助:
    国家自然科学基金(62166025,51668043);甘肃省重点研发计划(21YF5GA073)。

Research on Text Representation of Video Content Based on Multi-Modal Fusion and Multi-Layer Attention

ZHAO Hong, GUO Lan, CHEN Zhiwen, ZHENG Houze   

  1. College of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Received:2021-11-19 Revised:2021-12-27 Published:2021-12-30

摘要: 针对现有视频内容文本表述模型存在生成的文本表述单一、准确率不高等问题,提出一种融合帧级图像及音频信息的视频内容文本表述模型。基于自注意力机制设计单模态嵌入层网络结构,并将其嵌入单模态特征中学习单模态特征参数。采用联合表示、协作表示两种方法对单模态嵌入层输出的高维特征向量进行双模态特征融合,使模型能关注视频中不同目标间的交互关系,从而生成更加丰富、准确的视频文本表述。使用大规模数据集对模型进行预训练,并提取视频帧、视频所携带的音频等表征信息,将其送入编解码器实现视频内容的文本表述。在MSR-VTT和LSMDC数据集上的实验结果表明,所提模型的BLEU4、METEOR、ROUGEL和CIDEr指标分别为0.386、0.250、0.609和0.463,相较于MSR-VTT挑战赛中IIT DeIhi发布的模型,分别提升了0.082、0.037、0.115和0.257,能有效提升视频内容文本表述的准确率。

关键词: 视频内容文本描述, 多模态融合, 联合表示, 协作表示, 自注意力机制

Abstract: Aiming at the challenges of single-text representation and low accuracy of existing video content text-representation models, a video content text-reprsentation model that integrates frame-level image and audio information is proposed.The network structure of the model includes a single-mode embedding layer based on a self attention mechanism, and learns single-mode feature parameters.Two schemes, joint-representation and cooperative-representation, are adopted to fuse high-dimensional feature vectors output from the single-mode embedding layer, so that the model can focus on different objects in the video and their interaction, thereby generating richer and more accurate video text representation.The model is pretrained through large-scale datasets, and representation information, such as video frames and audio carried by the video, are extracted and sent to the coder to realize the text representation of the video content.The experimental results on MSR-VTT and LSMDC datasets show that the BLEU4, METEOR, ROUGEL, and CIDEr scores of the proposed model are 0.386, 0.250, 0.609 and 0.463 respectively.Compared with the model released by the IIT DeIhi in the MSR-VTT challenge, the proposed model improves the indexes above by 0.082, 0.037, 0.115 and 0.257 respectively.The model in this study can effectively improve the accuracy of the video content text-representation model.

Key words: text description of video content, multi-modal fusion, joint representation, collaborative representation, self attention mechanism

中图分类号: