基于潜在特征增强网络的视频描述生成方法

doi:10.19678/j.issn.1000-3428.0067206

摘要/Abstract

摘要：

视频描述生成旨在用自然语言描述视频中的物体及其相互作用。现有方法未充分利用视频中的时空语义信息，限制了模型生成准确描述语句的能力。为此，提出一种用于视频描述生成的潜在特征增强网络（LFAN）模型。利用不同的特征提取器提取外观特征、运动特征和目标特征，将对象级的目标特征分别和帧级的外观特征与运动特征融合，同时对融合后的不同特征进行增强，在生成描述前利用图神经网络和长短时记忆网络推理对象之间的时空关系，从而得到具有时空信息和语义信息的潜在特征，同时使用长短时记忆网络和门控循环单元的解码器生成视频的描述语句。该网络模型能够准确地学习到对象特征，进而引导生成更准确的词汇及与对象之间的关系。在MSVD和MSR-VTT数据集上的实验结果表明，LFAN模型可以显著提高生成描述语句的准确性，并与视频中的内容呈现出更好的语义一致性，在MSVD数据集上的BLEU@4和ROUGE-L分数分别为57.0和74.1，在MSR-VTT数据集上分别为43.8和62.1。

关键词: 视频描述生成, 潜在特征增强网络, 时空语义信息, 图神经网络, 特征融合

Abstract:

Video description generation aims to use natural language to describe objects and their interactions in videos. The existing methods do not fully utilize the spatio-temporal semantic information in videos, which limits the model's ability to generate accurate descriptive statements. To this end, a Latent Feature Augmented Network(LFAN) model is proposed for video description generation.Different feature extractors are used to extract appearance, motion, and target features, thereby fusing object level target features with frame level appearance and motion features. Concurrently, the fused different features are enhanced. Before generating descriptions, graph neural and long short-term memory networks are used to infer the spatio-temporal relationships between objects, thereby obtaining potential features with spatio-temporal and semantic information. Finally, a decoder using both a long short-term memory network and a gated loop unit is used to generate a description statement for the video. This network model can accurately learn object features and guide the generation of more accurate vocabulary and relationships with objects.The experimental results on MSVD and MSR-VTT datasets show that the LFAN model can significantly improve the accuracy of generating descriptive statements, exhibiting better semantic consistency with the content in the video. The BLEU@4 and ROUGE-L scores are 57.0 and 74.1 on MSVD, respectively, and 43.8 and 62.1 on the MSR-VTT dataset.

Key words: video description generation, latent feature augmented network, spatio-temporal semantic information, graph neural networks, feature fusion

李伟健, 胡慧君. 基于潜在特征增强网络的视频描述生成方法[J]. 计算机工程, 2024, 50(2): 266-272.

Weijian LI, Huijun HU. Video Description Generation Method Based on Latent Feature Augmented Network[J]. Computer Engineering, 2024, 50(2): 266-272.

https://www.ecice06.com/CN/Y2024/V50/I2/266

图/表 6

图1 LFAN生成描述的直观示例

Fig.1 An intuitive examples of LFAN generation descriptions

图2 LFAN模型框架

Fig.2 Framework of LFAN model

图3 LFAN生成描述与参考描述实例分析

Fig.3 Example analysis of LFAN generation description and reference description

参考文献 32

1	付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法. 科学技术与工程, 2021, 21(14): 5855- 5861.
	FU Y, MA Y, YE O. Video captioning method combining deep networks and visual text. Science Technology and Engineering, 2021, 21(14): 5855- 5861.
2	汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2022, 48(2): 375- 397.
	TANG P J, WANG H L. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2022, 48(2): 375- 397.
3	ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8319-8328.
4	PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2020: 1-10.
5	CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
6	SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-V4, Inception ResNet and the impact of Residual connections on learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2017: 1-10.
7	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?a new model and the kinetics dataset[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 6299-6308.
8	REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149. doi: 10.1109/TPAMI.2016.2577031
9	XU Y J, HAN Y H, HONG R C, et al. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933- 4944. doi: 10.1109/TIP.2018.2846664
10	WANG H Y, XU Y J, HAN Y H. Spotting and aggregating salient regions for video captioning[C]//Proceedings of the 26th International Conference on Multimedia. New York, USA: ACM Press, 2018: 1519-1526.
11	ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13096-13105.
12	WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 1-10.
13	KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171- 184. doi: 10.1023/A:1020346032608
14	BARBU A, BRIDGE A, BURCHILL Z, et al. Video in sentences out[EB/OL]. [2023-02-15]. https://arxiv.org/pdf/1204.2742.pdf.
15	DAS P, XU C L, DOELL R F, et al. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2013: 2634-2641.
16	VENUGOPALAN S, XU H, DONAHUE J, et al. Translating videos to natural language using deep recurrent neural networks[EB/OL]. [2023-02-15]. http://arXivpreprintarXiv:1412.4729,2014.
17	YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2015: 4507-4515.
18	CHEN Y Y, WANG S H, ZHANG W G, et al. Less is more: picking informative frames for video captioning[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 367-384.
19	WANG J B, WANG W, HUANG Y, et al. M3: multimodal memory modelling for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 7512-7520.
20	PEI W J, ZHANG J Y, WANG X R, et al. Memory-attended recurrent network for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8347-8356.
21	TAN G C, LIU D Q, WANG M, et al. Learning to discretely compose reasoning module networks for video captioning[EB/OL]. [2023-02-15]. https://arxiv.org/abs/2007.09049v1.
22	BAI Y, WANG J Y, LONG Y, et al. Discriminative latent semantic graph for video captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 3556-3564.
23	RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2021: 2514-2522.
24	ZHANG Z Q, QI Z A, YUAN C F, et al. Open-Book video captioning with retrieve-copy-generate network[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 9837-9846.
25	CHEN J W, PAN Y W, LI Y H, et al. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(1s): 1- 24.
26	LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2022: 17949-17958.
27	AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 12487-12496.
28	ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288.
29	DUTA I, NICOLICIOIU A L, LEORDEANU M. Discovering dynamic salient regions for spatio-temporal graph neural networks[C]//Proceedings of the 35th Conference on Neural Information Processing Systems. New York, USA: [s. n], 2021: 1-10.
30	HENDRYCKS D, GIMPEL K. Gaussian error linear units (GELUs)[EB/OL]. [2023-02-15]. https://arxiv.org/abs/1606.08415v4.
31	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 770-778.
32	侯静怡, 齐雅昀, 吴心筱, 等. 跨语言知识蒸馏的视频中文字幕生成. 计算机学报, 2021, 44(9): 1907- 1921.
	HOU J Y, QI Y Y, WU X X, et al. Cross-lingual knowledge distillation for Chinese video captioning. Chinese Journal of Computers, 2021, 44(9): 1907- 1921.

[1]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[2]	李俊仪, 李向阳, 龙朝勋, 李海燕, 李红松, 余鹏飞. 基于多级区域选择与跨层特征融合的野生菌分类[J]. 计算机工程, 2024, 50(9): 179-188.
[3]	张华青, 夏张涛, 陆晓庆, 童基均. 基于字形特征的血管外科命名实体识别[J]. 计算机工程, 2024, 50(8): 13-21.
[4]	李华昱, 张智康, 闫阳, 岳阳. 基于知识图谱增强的领域多模态实体识别[J]. 计算机工程, 2024, 50(8): 31-39.
[5]	刘锁兰, 王炎, 王洪元, 朱生升. 基于多流语义图卷积网络的人体行为识别[J]. 计算机工程, 2024, 50(8): 64-74.
[6]	赵婉秋, 张俊虎, 李海涛. 用于建筑物分割的平行结构特征融合网络[J]. 计算机工程, 2024, 50(8): 239-248.
[7]	赵宏, 王枭. 基于Swin-Transformer的黑色素瘤图像病灶分割研究[J]. 计算机工程, 2024, 50(8): 249-258.
[8]	王富平, 刘鸿玮, 张锲石, 段冠庄. 基于深度特征抑制的遮挡人脸识别网络[J]. 计算机工程, 2024, 50(8): 259-269.
[9]	闵莉, 董冰洁, 安冬. 基于多注意力机制与跨特征融合的语义分割算法[J]. 计算机工程, 2024, 50(8): 282-289.
[10]	陈宇航, 杨勇, 先木斯亚·买买提明, 帕力旦·吐尔逊, 樊小超, 任鸽, 刁宇峰. 基于主题感知和语义增强的作文自动评分方法[J]. 计算机工程, 2024, 50(8): 363-371.
[11]	何杏宇, 周易歆, 罗东旭, 杨桂松. 基于图神经网络和多主体评价的教学资源推荐[J]. 计算机工程, 2024, 50(7): 13-22.
[12]	谭巨全, 王然. 特征融合下田径录像3D人体动作DTW捕捉算法[J]. 计算机工程, 2024, 50(7): 71-78.
[13]	张溢文, 蔡满春, 陈咏豪, 朱懿, 姚利峰. 融合空间特征的多尺度深度伪造检测方法[J]. 计算机工程, 2024, 50(7): 240-250.
[14]	王晋涛, 秦昂, 张元, 陈一飞, 王廷凤, 谢承霖, 邹刚. 基于注意力增强与特征融合的中文医学实体识别[J]. 计算机工程, 2024, 50(7): 324-332.
[15]	张正康, 杨丹, 聂铁铮, 寇月. 基于图结构聚类的自监督学习疾病诊断方法[J]. 计算机工程, 2024, 50(7): 360-371.

选择文件类型/文献管理软件名称

选择包含的内容