Video Content Caption Generation Based on ViT and Semantic Guidance

doi:10.19678/j.issn.1000-3428.0064409

Abstract

Abstract: This paper proposes a video captioning method based on Vision Transformer（ViT） and semantic guidance to alleviate the problems of poor readability and low accuracy of caption text generated by exsisting video content captioning models.First，the visual features of the video are extracted by ReNeXt and Efficient Convolutional Network（ECO）.Second，the Semantic Detection Network（SDN） is trained using the extracted visual features as input and the probability prediction value of semantic label as output.Third，the static and dynamic visual features are globally encoded by ViT，and fused with the semantic features extracted by SDN.Finally，the fused features are decoded by the semantic Long Short-Term Memory（LSTM） network to generate the corresponding caption text of the video.Experimental results show that the introduction of semantic features in videos can guide the model to generate caption that are more in line with human habits，and the generated caption are more readable.The test results on the MSR-VTT dataset show that the BLEU-4，METEOR，ROUGE-L，and CIDER indicators of the model are 44.8，28.9，62.8，and 51.1，respectively.Compared with the current mainstream video content captioning models ADL and SBAT，the total scores on the four indicators increase by 16.6 and 16.8.

Key words: video content caption, video understanding, Vision Transformer（ViT） model, semantic guidance, Long Short-Term Memory（LSTM） network, attention mechanism

摘要： 现有视频内容描述模型生成的视频内容描述文本可读性差且准确率不高。基于ViT模型提出一种语义引导的视频内容描述方法。利用ReNeXt和ECO网络提取视频的视觉特征，以提取的视觉特征为输入、语义标签的概率预测值为输出训练语义检测网络(SDN)。在此基础上，通过ViT模型对静态和动态视觉特征进行全局编码，并与SDN提取的语义特征进行注意力融合，采用语义长短期记忆网络对融合特征进行解码，生成视频对应的描述文本。通过引入视频中的语义特征能够引导模型生成更符合人类习惯的描述，使生成的描述更具可读性。在MSR-VTT数据集上的测试结果表明，该模型的BLEU-4、METEOR、ROUGE-L和CIDEr指标分别为44.8、28.9、62.8和51.1，相比于当前主流的视频内容描述模型ADL和SBAT，提升的得分总和达到16.6和16.8。

关键词: 视频内容描述, 视频理解, ViT模型, 语义引导, 长短期记忆网络, 注意力机制

CLC Number:

TP391

ZHAO Hong, CHEN Zhiwen, GUO Lan, AN Dong. Video Content Caption Generation Based on ViT and Semantic Guidance[J]. Computer Engineering, 2023, 49(5): 247-254.

赵宏, 陈志文, 郭岚, 安冬. 基于ViT与语义引导的视频内容描述生成[J]. 计算机工程, 2023, 49(5): 247-254.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064409

http://www.ecice06.com/EN/Y2023/V49/I5/247

Figures/Tables 9

References

[1] 宁培阳.基于深度学习的视频描述方法研究[D].广州:华南理工大学,2019. NING P Y.Video captioning based on deep learning[D].Guangzhou:South China University of Technology,2019.(in Chinese)
[2] 王金金,曾上游,李文惠,等.基于扩张卷积的注意力机制视频描述模型[J].电子测量技术,2021,44(23):99-104. WANG J J,ZENG S Y,LI W H,et al.Video description model of attention mechanism based on dilated convolution[J].Electronic Measurement Technology,2021,44(23):99-104.(in Chinese)
[3] 潘晓容.基于视频内容的动态摘要生成算法研究[D].西安:西安理工大学,2021. PAN X R.Research on dynamic summarization generation algorithm based on video content[D].Xi'an:Xi'an University of Technology,2021.(in Chinese)
[4] 汤鹏杰,王瀚漓.从视频到语言:视频标题生成与描述研究综述[J].自动化学报,2022,48(2):375-397. TANG P J,WANG H L.From video to language:survey of video captioning and description[J].Acta Automatica Sinica,2022,48(2):375-397.(in Chinese)
[5] SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:1-9.
[6] LI Y C,ZHOU R G,XU R Q,et al.A quantum deep convolutional neural network for image recognition[J].Quantum Science and Technology,2020,5(4):044003.
[7] PARK J,WOO S,LEE J Y,et al.A simple and light-weight attention module for convolutional neural networks[J].International Journal of Computer Vision,2020,128(4):783-798.
[8] YOUSUF H,LAHZI M,SALLOUM S A,et al.A systematic review on sequence-to-sequence learning with neural network and its models[J].International Journal of Electrical and Computer Engineering,2021,11(3):2315.
[9] OTTER D W,MEDINA J R,KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(2):604-624.
[10] XIAO J Q,ZHOU Z Y.Research progress of RNN language model[C]//Proceedings of IEEE International Conference on Artificial Intelligence and Computer Applications.Washington D.C.,USA:IEEE Press,2020:1285-1288.
[11] 赵宏,郭岚,陈志文,等.基于多模态融合与多层注意力的视频内容文本表述研究[J].计算机工程,2022,48(10):45-54. ZHAO H,GUO L,CHEN Z W,et al.Research on text representation of video content based on multi-modal fusion and multi-layer attention[J].Computer Engineering,2022,48(10):45-54.(in Chinese)
[12] VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2016:4534-4542.
[13] TANG P J,WANG H L,LI Q Y.Rich visual and language representation with complementary semantics for video captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2019,15(2):1-23.
[14] ZHANG J C,PENG Y X.Video captioning with object-aware spatio-temporal correlation and aggregation[J].IEEE Transactions on Image Processing,2020,29:6209-6222.
[15] 丁恩杰,刘忠育,刘亚峰,等.基于多维度和多模态信息的视频描述方法[J].通信学报,2020,41(2):36-43. DING E J,LIU Z Y,LIU Y F,et al.Video description method based on multidimensional and multimodal information[J].Journal on Communications,2020,41(2):36-43.(in Chinese)
[16] CHEN H,LIN K,MAYE A,et al.A semantics-assisted video captioning model trained with scheduled sampling[J].Frontiers in Robotics and AI,2020,7:475767.
[17] RAHMAN M M,ABEDIN T,PROTTOY K S S,et al.Video captioning with stacked attention and semantic hard pull[J].PeerJ Computer Science,2021,7:e664.
[18] XIE S N,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:5987-5995.
[19] ZOLFAGHARI M,SINGH K,BROX T.ECO:efficient convolutional network for online video understanding[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2018:695-712.
[20] FAN H Q,XIONG B,MANGALAM K,et al.Multiscale Vision Transformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2022:6804-6815.
[21] EL-NOUBY A,NEVEROVA N,LAPTEV I,et al.Training Vision Transformers for image retrieval[EB/OL].[2022-03-02].https://arxiv.org/abs/2102.05644.
[22] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16×16 words:Transformers for image recognition at scale[EB/OL].[2022-03-02].https://arxiv.org/abs/2010.11929.
[23] HAN K,WANG Y,CHEN H,et al.A survey on Vision Transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(1):87-110.
[24] IASHIN V,RAHTU E.A better use of audio-visual cues:dense video captioning with Bi-modal Transformer[EB/OL].[2022-03-02].https://arxiv.org/abs/2005.08271.
[25] ZHOU L W,ZHOU Y B,CORSO J J,et al.End-to-end dense video captioning with masked Transformer[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2018:8739-8748.
[26] ZHAO H,CHEN Z W,GUO L,et al.Video captioning based on Vision Transformer and reinforcement learning[J].PeerJ Computer Science,2022,8:e916.
[27] ZHU L C,XU Z W,YANG Y.Bidirectional multirate reconstruction for temporal modeling in videos[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:1339-1348.
[28] ZHANG J C,PENG Y X.Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2020:8319-8328.
[29] LIU S,REN Z,YUAN J S.SibNet:sibling convolutional encoder for video captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(9):3259-3272.
[30] WANG T,ZHENG H C,YU M J,et al.Event-centric hierarchical representation for dense video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(5):1890-1900.
[31] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.New York,USA:ACM Press,2002:311-318.
[32] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.Washington D.C.,USA:IEEE Press,2004:74-81.
[33] VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:4566-4575.
[34] DENKOWSKI M,LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation.Stroudsburg,USA:Association for Computational Linguistics,2014:376-380.
[35] XU J,MEI T,YAO T,et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:5288-5296.
[36] ZHENG Y,ZHANG Y J,FENG R,et al.Stacked multimodal attention network for context-aware video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(1):31-42.
[37] JI W T,WANG R L,TIAN Y,et al.An attention based dual learning approach for video captioning[J].Applied Soft Computing,2022,117:108332.
[38] JIN T,HUANG S Y,CHEN M,et al.SBAT:video captioning with sparse boundary-aware transformer[EB/OL].[2022-03-02].https://arxiv.org/abs/2007.11888.
[39] LIU S,REN Z,YUAN J S.SibNet:sibling convolutional encoder for video captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(9):3259-3272.
[40] RYU H,KANG S,KANG H,et al.Semantic grouping network for video captioning[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2514-2522.

Please choose a citation manager

Content to export