Video Description Generation Method Based on Latent Feature Augmented Network

doi:10.19678/j.issn.1000-3428.0067206

Abstract

Abstract:

Video description generation aims to use natural language to describe objects and their interactions in videos. The existing methods do not fully utilize the spatio-temporal semantic information in videos, which limits the model's ability to generate accurate descriptive statements. To this end, a Latent Feature Augmented Network(LFAN) model is proposed for video description generation.Different feature extractors are used to extract appearance, motion, and target features, thereby fusing object level target features with frame level appearance and motion features. Concurrently, the fused different features are enhanced. Before generating descriptions, graph neural and long short-term memory networks are used to infer the spatio-temporal relationships between objects, thereby obtaining potential features with spatio-temporal and semantic information. Finally, a decoder using both a long short-term memory network and a gated loop unit is used to generate a description statement for the video. This network model can accurately learn object features and guide the generation of more accurate vocabulary and relationships with objects.The experimental results on MSVD and MSR-VTT datasets show that the LFAN model can significantly improve the accuracy of generating descriptive statements, exhibiting better semantic consistency with the content in the video. The BLEU@4 and ROUGE-L scores are 57.0 and 74.1 on MSVD, respectively, and 43.8 and 62.1 on the MSR-VTT dataset.

Key words: video description generation, latent feature augmented network, spatio-temporal semantic information, graph neural networks, feature fusion

摘要：

视频描述生成旨在用自然语言描述视频中的物体及其相互作用。现有方法未充分利用视频中的时空语义信息，限制了模型生成准确描述语句的能力。为此，提出一种用于视频描述生成的潜在特征增强网络（LFAN）模型。利用不同的特征提取器提取外观特征、运动特征和目标特征，将对象级的目标特征分别和帧级的外观特征与运动特征融合，同时对融合后的不同特征进行增强，在生成描述前利用图神经网络和长短时记忆网络推理对象之间的时空关系，从而得到具有时空信息和语义信息的潜在特征，同时使用长短时记忆网络和门控循环单元的解码器生成视频的描述语句。该网络模型能够准确地学习到对象特征，进而引导生成更准确的词汇及与对象之间的关系。在MSVD和MSR-VTT数据集上的实验结果表明，LFAN模型可以显著提高生成描述语句的准确性，并与视频中的内容呈现出更好的语义一致性，在MSVD数据集上的BLEU@4和ROUGE-L分数分别为57.0和74.1，在MSR-VTT数据集上分别为43.8和62.1。

关键词: 视频描述生成, 潜在特征增强网络, 时空语义信息, 图神经网络, 特征融合

Weijian LI, Huijun HU. Video Description Generation Method Based on Latent Feature Augmented Network[J]. Computer Engineering, 2024, 50(2): 266-272.

李伟健, 胡慧君. 基于潜在特征增强网络的视频描述生成方法[J]. 计算机工程, 2024, 50(2): 266-272.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067206

https://www.ecice06.com/EN/Y2024/V50/I2/266

Figures/Tables 6

References 32

1	付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法. 科学技术与工程, 2021, 21(14): 5855- 5861.
	FU Y, MA Y, YE O. Video captioning method combining deep networks and visual text. Science Technology and Engineering, 2021, 21(14): 5855- 5861.
2	汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2022, 48(2): 375- 397.
	TANG P J, WANG H L. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2022, 48(2): 375- 397.
3	ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8319-8328.
4	PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2020: 1-10.
5	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
6	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-V4, Inception ResNet and the impact of Residual connections on learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2017: 1-10.
7	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?a new model and the kinetics dataset[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 6299-6308.
8	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
9	XU Y J, HAN Y H, HONG R C, et al. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933- 4944. doi: 10.1109/TIP.2018.2846664
10	WANG H Y, XU Y J, HAN Y H. Spotting and aggregating salient regions for video captioning[C]//Proceedings of the 26th International Conference on Multimedia. New York, USA: ACM Press, 2018: 1519-1526.
11	ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13096-13105.
12	WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 1-10.
13	KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171- 184. doi: 10.1023/A:1020346032608
14	BARBU A, BRIDGE A, BURCHILL Z, et al. Video in sentences out[EB/OL]. [2023-02-15]. https://arxiv.org/pdf/1204.2742.pdf.
15	DAS P, XU C L, DOELL R F, et al. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2013: 2634-2641.
16	VENUGOPALAN S, XU H, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[EB/OL]. [2023-02-15]. http://arXivpreprintarXiv:1412.4729,2014.
17	YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2015: 4507-4515.
18	CHEN Y Y, WANG S H, ZHANG W G, et al. Less is more: picking informative frames for video captioning[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 367-384.
19	WANG J B, WANG W, HUANG Y, et al. M3: multimodal memory modelling for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 7512-7520.
20	PEI W J, ZHANG J Y, WANG X R, et al. Memory-attended recurrent network for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8347-8356.
21	TAN G C, LIU D Q, WANG M, et al. Learning to discretely compose reasoning module networks for video captioning[EB/OL]. [2023-02-15]. https://arxiv.org/abs/2007.09049v1.
22	BAI Y, WANG J Y, LONG Y, et al. Discriminative latent semantic graph for video captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 3556-3564.
23	RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2021: 2514-2522.
24	ZHANG Z Q, QI Z A, YUAN C F, et al. Open-Book video captioning with retrieve-copy-generate network[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 9837-9846.
25	CHEN J W, PAN Y W, LI Y H, et al. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(1s): 1- 24.
26	LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2022: 17949-17958.
27	AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 12487-12496.
28	ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288.
29	DUTA I, NICOLICIOIU A L, LEORDEANU M. Discovering dynamic salient regions for spatio-temporal graph neural networks[C]//Proceedings of the 35th Conference on Neural Information Processing Systems. New York, USA: [s. n], 2021: 1-10.
30	HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. [2023-02-15]. https://arxiv.org/abs/1606.08415v4.
31	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 770-778.
32	侯静怡, 齐雅昀, 吴心筱, 等. 跨语言知识蒸馏的视频中文字幕生成. 计算机学报, 2021, 44(9): 1907- 1921.
	HOU J Y, QI Y Y, WU X X, et al. Cross-lingual knowledge distillation for Chinese video captioning. Chinese Journal of Computers, 2021, 44(9): 1907- 1921.

[1]	LI Junyi, LI Xiangyang, LONG Chaoxun, LI Haiyan, LI Hongsong, YU Pengfei. Wild Mushroom Classification Based on Multi-level Region Selection and Cross-layer Feature Fusion [J]. Computer Engineering, 2024, 50(9): 179-188.
[2]	Huaqing ZHANG, Zhangtao XIA, Xiaoqing LU, Jijun TONG. Named Entity Recognition of Vascular Surgery Based on Glyph Features [J]. Computer Engineering, 2024, 50(8): 13-21.
[3]	Huayu LI, Zhikang ZHANG, Yang YAN, Yang YUE. Enhanced Domain Multi-modal Entity Recognition Based on Knowledge Graph [J]. Computer Engineering, 2024, 50(8): 31-39.
[4]	Suolan LIU, Yan WANG, Hongyuan WANG, Shengsheng ZHU. Human Behavior Recognition Based on Multi-Stream Semantic Graph Convolutional Network [J]. Computer Engineering, 2024, 50(8): 64-74.
[5]	Wanqiu ZHAO, Junhu ZHANG, Haitao LI. Feature Fusion Network with Parallel Structure for Building Segmentation [J]. Computer Engineering, 2024, 50(8): 239-248.
[6]	Hong ZHAO, Xiao WANG. Study on Lesion Segmentation of Melanoma Images Based on Swin-Transformer [J]. Computer Engineering, 2024, 50(8): 249-258.
[7]	Li MIN, Bingjie DONG, Dong AN. Semantic Segmentation Algorithm Based on Multi-Attention Mechanism and Cross-Feature Fusion [J]. Computer Engineering, 2024, 50(8): 282-289.
[8]	Yuhang CHEN, Yong YANG, Xianmusiya·Maimaitiming, Palidan·Tuerxun, Xiaochao FAN, Ge REN, Yufeng DIAO. Automatic Essay Scoring Method Based on Topic Perception and Semantic Enhancement [J]. Computer Engineering, 2024, 50(8): 363-371.
[9]	Juquan TAN, Ran WANG. Dynamic Time Warping Capture Algorithm for 3D Human Body Movements in Track and Field Video Recording Under Feature Fusion [J]. Computer Engineering, 2024, 50(7): 71-78.
[10]	Yiwen ZHANG, Manchun CAI, Yonghao CHEN, Yi ZHU, Lifeng YAO. Multi-Scale Deepfake Detection Method with Fusion of Spatial Features [J]. Computer Engineering, 2024, 50(7): 240-250.
[11]	Jintao WANG, Ang QIN, Yuan ZHANG, Yifei CHEN, Tingfeng WANG, Chenglin XIE, Gang ZOU. Chinese Medical Entity Recognition Based on Attention Enhancement and Feature Fusion [J]. Computer Engineering, 2024, 50(7): 324-332.
[12]	LI Yakang, CHEN Gang. Automated Selection for Physical Models of Small-Angle Neutron Scattering [J]. Computer Engineering, 2024, 50(6): 56-64.
[13]	YANG Shuo, WANG Yiding. Facial Animation Algorithm Based on Improved Thin Plate Spline Motion Model [J]. Computer Engineering, 2024, 50(6): 255-265.
[14]	CAI Yixiang, QIN Pinle, ZENG Jianchao, JIN Zanxia, QIN Jia, ZHAI Shuangjiao. Research on Person Re-Identification Method for Large-Angle Viewpoint Differences [J]. Computer Engineering, 2024, 50(5): 330-341.
[15]	LIANG Songlin, LIN Wei, WANG Jue, YANG Qing. Research on Network Malicious Traffic Detection for Post-Exploitation Attack Behavior [J]. Computer Engineering, 2024, 50(5): 128-138.

Please choose a citation manager

Content to export