Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2023, Vol. 49 ›› Issue (5): 247-254. doi: 10.19678/j.issn.1000-3428.0064409

• Graphics and Image Processing • Previous Articles     Next Articles

Video Content Caption Generation Based on ViT and Semantic Guidance

ZHAO Hong, CHEN Zhiwen, GUO Lan, AN Dong   

  1. College of Computer and Communication, Lanzhou University of Technology, Lanzhou 730050, China
  • Received:2022-04-08 Revised:2022-05-17 Published:2022-08-09

基于ViT与语义引导的视频内容描述生成

赵宏, 陈志文, 郭岚, 安冬   

  1. 兰州理工大学 计算机与通信学院, 兰州 730050
  • 作者简介:赵宏(1971-),男,教授、博士,主研方向为计算机视觉、自然语言处理、深度学习;陈志文(通信作者)、郭岚、安冬,硕士研究生。
  • 基金资助:
    国家自然科学基金“基于深度学习的广谱恶意域名检测方法研究”(62166025);甘肃省重点研发计划“监控视频内容理解和描述文本生成以及在重点行业的示范应用”(21YF5GA073)。

Abstract: This paper proposes a video captioning method based on Vision Transformer(ViT) and semantic guidance to alleviate the problems of poor readability and low accuracy of caption text generated by exsisting video content captioning models.First,the visual features of the video are extracted by ReNeXt and Efficient Convolutional Network(ECO).Second,the Semantic Detection Network(SDN) is trained using the extracted visual features as input and the probability prediction value of semantic label as output.Third,the static and dynamic visual features are globally encoded by ViT,and fused with the semantic features extracted by SDN.Finally,the fused features are decoded by the semantic Long Short-Term Memory(LSTM) network to generate the corresponding caption text of the video.Experimental results show that the introduction of semantic features in videos can guide the model to generate caption that are more in line with human habits,and the generated caption are more readable.The test results on the MSR-VTT dataset show that the BLEU-4,METEOR,ROUGE-L,and CIDER indicators of the model are 44.8,28.9,62.8,and 51.1,respectively.Compared with the current mainstream video content captioning models ADL and SBAT,the total scores on the four indicators increase by 16.6 and 16.8.

Key words: video content caption, video understanding, Vision Transformer(ViT) model, semantic guidance, Long Short-Term Memory(LSTM) network, attention mechanism

摘要: 现有视频内容描述模型生成的视频内容描述文本可读性差且准确率不高。基于ViT模型提出一种语义引导的视频内容描述方法。利用ReNeXt和ECO网络提取视频的视觉特征,以提取的视觉特征为输入、语义标签的概率预测值为输出训练语义检测网络(SDN)。在此基础上,通过ViT模型对静态和动态视觉特征进行全局编码,并与SDN提取的语义特征进行注意力融合,采用语义长短期记忆网络对融合特征进行解码,生成视频对应的描述文本。通过引入视频中的语义特征能够引导模型生成更符合人类习惯的描述,使生成的描述更具可读性。在MSR-VTT数据集上的测试结果表明,该模型的BLEU-4、METEOR、ROUGE-L和CIDEr指标分别为44.8、28.9、62.8和51.1,相比于当前主流的视频内容描述模型ADL和SBAT,提升的得分总和达到16.6和16.8。

关键词: 视频内容描述, 视频理解, ViT模型, 语义引导, 长短期记忆网络, 注意力机制

CLC Number: