基于多模态融合与多层注意力的视频内容文本表述研究

doi:10.19678/j.issn.1000-3428.0063294

摘要/Abstract

摘要： 针对现有视频内容文本表述模型存在生成的文本表述单一、准确率不高等问题，提出一种融合帧级图像及音频信息的视频内容文本表述模型。基于自注意力机制设计单模态嵌入层网络结构，并将其嵌入单模态特征中学习单模态特征参数。采用联合表示、协作表示两种方法对单模态嵌入层输出的高维特征向量进行双模态特征融合，使模型能关注视频中不同目标间的交互关系，从而生成更加丰富、准确的视频文本表述。使用大规模数据集对模型进行预训练，并提取视频帧、视频所携带的音频等表征信息，将其送入编解码器实现视频内容的文本表述。在MSR-VTT和LSMDC数据集上的实验结果表明，所提模型的BLEU4、METEOR、ROUGEL和CIDEr指标分别为0.386、0.250、0.609和0.463，相较于MSR-VTT挑战赛中IIT DeIhi发布的模型，分别提升了0.082、0.037、0.115和0.257，能有效提升视频内容文本表述的准确率。

关键词: 视频内容文本描述, 多模态融合, 联合表示, 协作表示, 自注意力机制

Abstract: Aiming at the challenges of single-text representation and low accuracy of existing video content text-representation models, a video content text-reprsentation model that integrates frame-level image and audio information is proposed.The network structure of the model includes a single-mode embedding layer based on a self attention mechanism, and learns single-mode feature parameters.Two schemes, joint-representation and cooperative-representation, are adopted to fuse high-dimensional feature vectors output from the single-mode embedding layer, so that the model can focus on different objects in the video and their interaction, thereby generating richer and more accurate video text representation.The model is pretrained through large-scale datasets, and representation information, such as video frames and audio carried by the video, are extracted and sent to the coder to realize the text representation of the video content.The experimental results on MSR-VTT and LSMDC datasets show that the BLEU4, METEOR, ROUGEL, and CIDEr scores of the proposed model are 0.386, 0.250, 0.609 and 0.463 respectively.Compared with the model released by the IIT DeIhi in the MSR-VTT challenge, the proposed model improves the indexes above by 0.082, 0.037, 0.115 and 0.257 respectively.The model in this study can effectively improve the accuracy of the video content text-representation model.

Key words: text description of video content, multi-modal fusion, joint representation, collaborative representation, self attention mechanism

中图分类号:

TP391

赵宏, 郭岚, 陈志文, 郑厚泽. 基于多模态融合与多层注意力的视频内容文本表述研究[J]. 计算机工程, 2022, 48(10): 45-54.

ZHAO Hong, GUO Lan, CHEN Zhiwen, ZHENG Houze. Research on Text Representation of Video Content Based on Multi-Modal Fusion and Multi-Layer Attention[J]. Computer Engineering, 2022, 48(10): 45-54.

http://www.ecice06.com/CN/Y2022/V48/I10/45

图/表 14

参考文献

[1] JI J Z, XU C, ZHANG X D, et al.Spatio-temporal memory attention for image captioning[J].IEEE Transactions on Image Processing, 2020, 29:7615-7628.
[2] YANG J C, WANG C G, JIANG B, et al.Visual perception enabled industry intelligence:state of the art, challenges and prospects[J].IEEE Transactions on Industrial Informatics, 2021, 17(3):2204-2219.
[3] GUO W Y, ZHANG Y, YANG J F, et al.re-attention for visual question answering[J].IEEE Transactions on Image Processing, 2021, 30:6730-6743.
[4] LIU F L, WU X, GE S, et al.DiMBERT:learning vision-language grounded representations with disentangled multimodal-attention[J].ACM Transactions on Knowledge Discovery from Data, 2022, 16(1):1-19.
[5] ZHANG L, HE Z W, YANG Y, et al.Tasks integrated networks:joint detection and retrieval for image search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1):456-473.
[6] QIAO S S, WANG R P, SHAN S G, et al.Deep video code for efficient face video retrieval[J].Pattern Recognition, 2021, 113:107754-107762.
[7] WU S M, WIELAND J, FARIVAR O, et al.Automatic alt-text:computer-generated image descriptions for blind users on a social network service[C]//Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing.New York, USA:ACM Press, 2017:1180-1192.
[8] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G, et al.YouTube2Text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2712-2719.
[9] PEREZ-MARTIN J, BUSTOS B, REZ J.Improving video captioning with temporal composition of a visual-syntactic embedding[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision.Washington D.C., USA:IEEE Press, 2021:3038-3048.
[10] ZHU M J, DUAN C R, YU C B.Video captioning in compressed video[EB/OL].[2021-10-09].https://arxiv.org/abs/2101.00359.
[11] SZEGEDY C, LIU W, JIA Y Q, et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:1-9.
[12] LI Y C, ZHOU R G, XU R Q, et al.A quantum deep convolutional neural network for image recognition[J].Quantum Science and Technology, 2020, 5(4):44003-44012.
[13] PARK J, WOO S, LEE J Y, et al.A simple and light-weight attention module for convolutional neural networks[J].International Journal of Computer Vision, 2020, 128(4):783-798.
[14] YOUSUF H, LAHZI M, SALLOUM S A, et al.A systematic review on sequence-to-sequence learning with neural network and its models[J].International Journal of Electrical and Computer Engineering, 2021, 11(3):2315-2321.
[15] OTTER D W, MEDINA J R, KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(2):604-624.
[16] XIAO J Q, ZHOU Z Y.Research progress of RNN language model[C]//Proceedings of IEEE International Conference on Artificial Intelligence and Computer Applications.Washington D.C., USA:IEEE Press, 2020:1285-1288.
[17] 何俊, 张彩庆, 李小珍, 等.面向深度学习的多模态融合技术研究综述[J].计算机工程, 2020, 46(5):1-11. HE J, ZHANG C Q, LI X Z, et al.Survey of research on multimodal fusion technology for deep learning[J].Computer Engineering, 2020, 46(5):1-11.(in Chinese)
[18] VENUGOPALAN S, XU H J, DONAHUE J, et al.Translating videos to natural language using deep recurrent neural networks[EB/OL].[2021-10-09].https://arxiv.org/abs/1412.4729.
[19] DONG J F, LI X R, LAN W Y, et al.Early embedding and late reranking for video captioning[C]//Proceedings of the 24th ACM International Conference on Multimedia.Washington D.C., USA:IEEE Press, 2016:1082-1086.
[20] YAO L, TORABI A, CHO K, et al.Describing videos by exploiting temporal structure[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4507-4515.
[21] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al.Sequence to sequence-video to text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4534-4542.
[22] WANG J B, WANG W, HUANG Y, et al.M3:multimodal memory modelling for video captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7512-7520.
[23] GAO L L, LI X P, SONG J K, et al.Hierarchical LSTMs with adaptive attention for visual captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5):1112-1131.
[24] WANG N W, LIU H Z, XU C.Deep learning for the detection of COVID-19 using transfer learning and model integration[C]//Proceedings of the 10th International Conference on Electronics Information and Emergency Communication.Washington D.C., USA:IEEE Press, 2020:281-284.
[25] GUO L T, LIU J, ZHU X X, et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10324-10333.
[26] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780.
[27] 单礼岩, 李新伟.基于时空信息特征融合的视频指纹算法[J].计算机工程, 2019, 45(8):260-265, 274. SHAN L Y, LI X W.Video fingerprinting algorithm based on temporal and spatial information feature fusion[J].Computer Engineering, 2019, 45(8):260-265, 274.(in Chinese)
[28] HU J, SHEN L, ALBANIE S, et al.Squeeze-and-excitation networks[C]//Proceedings of IEEE Transactions on Pattern Analysis and Machine Intelligence.Washington D.C., USA:IEEE Press, 2018:2011-2023.
[29] LIU Q, WANG C.Within-component and between-component multi-kernel discriminating correlation analysis for colour face recognition[J].IET Computer Vision, 2017, 11(8):663-674.
[30] ALBADR M A A, TIUN S, AYOB M, et al.Mel-frequency cepstral coefficient features based on standard deviation and principal component analysis for language identification systems[J].Cognitive Computation, 2021, 13(5):1136-1153.
[31] YANG N N, DEY N, SHERRATT R S, et al.Recognize basic emotional statesin speech by machine learning techniques using mel-frequency cepstral coefficient features[J].Journal of Intelligent & Fuzzy Systems, 2020, 39(2):1925-1936.
[32] 项要杰, 杨俊安, 李晋徽, 等.一种适用于说话人识别的改进Mel滤波器[J].计算机工程, 2013, 39(11):214-217, 222. XIANG Y J, YANG J A, LI J H, et al.An improved mel-frequency filter for speaker recognition[J].Computer Engineering, 2013, 39(11):214-217, 222.(in Chinese)
[33] MORENCY L P, BALTRUŠAITIS T.Multimodal machine learning:integrating language, vision and speech[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2017:3-5
[34] XU J, MEI T, YAO T, et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:5288-5296.
[35] ROHRBACH A, ROHRBACH M, TANDON N, et al.A dataset for movie description[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3202-3212.
[36] HUANG Y F, SHIH L P, TSAI C H, et al.Describing video scenarios using deep learning techniques[J].International Journal of Intelligent Systems, 2021, 36(6):2465-2490.
[37] VEDANTAM R, ZITNICK C L, PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:4566-4575.
[38] DENKOWSKI M, LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation.Stroudsburg, USA:Association for Computational Linguistics, 2014:376-380.
[39] LIN C Y.Rouge:a package for automatic evaluation of summaries[EB/OL].[2021-10-09].https://www.researchgate.net/publication/224890821_ROUGE_A_Package_for_Automatic_Evaluation_of_summaries.
[40] PAPINENI K, ROUKOS S, WARD T, et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2002:311-318.
[41] KINGMA D P, BA J.Adam:a method for stochastic optimization[EB/OL].[2021-10-09].https://ui.adsabs.harvard.edu/abs/2014arXiv1412.6980K.
[42] VENUGOPALAN S, XU H J, DONAHUE J, et al.Translating videos to natural language using deep recurrent neural networks[EB/OL].[2021-10-09].https://arxiv.org/abs/1412.4729.
[43] WANG X, WU J W, CHEN J K, et al.VaTeX:a large-scale, high-quality multilingual dataset for video-and-language research[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:4580-4590.

选择文件类型/文献管理软件名称

选择包含的内容