[1] 宁培阳.基于深度学习的视频描述方法研究[D].广州:华南理工大学,2019. NING P Y.Video captioning based on deep learning[D].Guangzhou:South China University of Technology,2019.(in Chinese) [2] 王金金,曾上游,李文惠,等.基于扩张卷积的注意力机制视频描述模型[J].电子测量技术,2021,44(23):99-104. WANG J J,ZENG S Y,LI W H,et al.Video description model of attention mechanism based on dilated convolution[J].Electronic Measurement Technology,2021,44(23):99-104.(in Chinese) [3] 潘晓容.基于视频内容的动态摘要生成算法研究[D].西安:西安理工大学,2021. PAN X R.Research on dynamic summarization generation algorithm based on video content[D].Xi'an:Xi'an University of Technology,2021.(in Chinese) [4] 汤鹏杰,王瀚漓.从视频到语言:视频标题生成与描述研究综述[J].自动化学报,2022,48(2):375-397. TANG P J,WANG H L.From video to language:survey of video captioning and description[J].Acta Automatica Sinica,2022,48(2):375-397.(in Chinese) [5] SZEGEDY C,LIU W,JIA Y Q,et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:1-9. [6] LI Y C,ZHOU R G,XU R Q,et al.A quantum deep convolutional neural network for image recognition[J].Quantum Science and Technology,2020,5(4):044003. [7] PARK J,WOO S,LEE J Y,et al.A simple and light-weight attention module for convolutional neural networks[J].International Journal of Computer Vision,2020,128(4):783-798. [8] YOUSUF H,LAHZI M,SALLOUM S A,et al.A systematic review on sequence-to-sequence learning with neural network and its models[J].International Journal of Electrical and Computer Engineering,2021,11(3):2315. [9] OTTER D W,MEDINA J R,KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems,2021,32(2):604-624. [10] XIAO J Q,ZHOU Z Y.Research progress of RNN language model[C]//Proceedings of IEEE International Conference on Artificial Intelligence and Computer Applications.Washington D.C.,USA:IEEE Press,2020:1285-1288. [11] 赵宏,郭岚,陈志文,等.基于多模态融合与多层注意力的视频内容文本表述研究[J].计算机工程,2022,48(10):45-54. ZHAO H,GUO L,CHEN Z W,et al.Research on text representation of video content based on multi-modal fusion and multi-layer attention[J].Computer Engineering,2022,48(10):45-54.(in Chinese) [12] VENUGOPALAN S,ROHRBACH M,DONAHUE J,et al.Sequence to sequence-video to text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2016:4534-4542. [13] TANG P J,WANG H L,LI Q Y.Rich visual and language representation with complementary semantics for video captioning[J].ACM Transactions on Multimedia Computing,Communications,and Applications,2019,15(2):1-23. [14] ZHANG J C,PENG Y X.Video captioning with object-aware spatio-temporal correlation and aggregation[J].IEEE Transactions on Image Processing,2020,29:6209-6222. [15] 丁恩杰,刘忠育,刘亚峰,等.基于多维度和多模态信息的视频描述方法[J].通信学报,2020,41(2):36-43. DING E J,LIU Z Y,LIU Y F,et al.Video description method based on multidimensional and multimodal information[J].Journal on Communications,2020,41(2):36-43.(in Chinese) [16] CHEN H,LIN K,MAYE A,et al.A semantics-assisted video captioning model trained with scheduled sampling[J].Frontiers in Robotics and AI,2020,7:475767. [17] RAHMAN M M,ABEDIN T,PROTTOY K S S,et al.Video captioning with stacked attention and semantic hard pull[J].PeerJ Computer Science,2021,7:e664. [18] XIE S N,GIRSHICK R,DOLLÁR P,et al.Aggregated residual transformations for deep neural networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:5987-5995. [19] ZOLFAGHARI M,SINGH K,BROX T.ECO:efficient convolutional network for online video understanding[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2018:695-712. [20] FAN H Q,XIONG B,MANGALAM K,et al.Multiscale Vision Transformers[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2022:6804-6815. [21] EL-NOUBY A,NEVEROVA N,LAPTEV I,et al.Training Vision Transformers for image retrieval[EB/OL].[2022-03-02].https://arxiv.org/abs/2102.05644. [22] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16×16 words:Transformers for image recognition at scale[EB/OL].[2022-03-02].https://arxiv.org/abs/2010.11929. [23] HAN K,WANG Y,CHEN H,et al.A survey on Vision Transformer[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(1):87-110. [24] IASHIN V,RAHTU E.A better use of audio-visual cues:dense video captioning with Bi-modal Transformer[EB/OL].[2022-03-02].https://arxiv.org/abs/2005.08271. [25] ZHOU L W,ZHOU Y B,CORSO J J,et al.End-to-end dense video captioning with masked Transformer[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2018:8739-8748. [26] ZHAO H,CHEN Z W,GUO L,et al.Video captioning based on Vision Transformer and reinforcement learning[J].PeerJ Computer Science,2022,8:e916. [27] ZHU L C,XU Z W,YANG Y.Bidirectional multirate reconstruction for temporal modeling in videos[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2017:1339-1348. [28] ZHANG J C,PENG Y X.Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2020:8319-8328. [29] LIU S,REN Z,YUAN J S.SibNet:sibling convolutional encoder for video captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(9):3259-3272. [30] WANG T,ZHENG H C,YU M J,et al.Event-centric hierarchical representation for dense video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2021,31(5):1890-1900. [31] PAPINENI K,ROUKOS S,WARD T,et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.New York,USA:ACM Press,2002:311-318. [32] LIN C Y.ROUGE:a package for automatic evaluation of summaries[C]//Proceedings of the Workshop on Text Summarization Branches Out.Washington D.C.,USA:IEEE Press,2004:74-81. [33] VEDANTAM R,LAWRENCE ZITNICK C,PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:4566-4575. [34] DENKOWSKI M,LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation.Stroudsburg,USA:Association for Computational Linguistics,2014:376-380. [35] XU J,MEI T,YAO T,et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:5288-5296. [36] ZHENG Y,ZHANG Y J,FENG R,et al.Stacked multimodal attention network for context-aware video captioning[J].IEEE Transactions on Circuits and Systems for Video Technology,2022,32(1):31-42. [37] JI W T,WANG R L,TIAN Y,et al.An attention based dual learning approach for video captioning[J].Applied Soft Computing,2022,117:108332. [38] JIN T,HUANG S Y,CHEN M,et al.SBAT:video captioning with sparse boundary-aware transformer[EB/OL].[2022-03-02].https://arxiv.org/abs/2007.11888. [39] LIU S,REN Z,YUAN J S.SibNet:sibling convolutional encoder for video captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2021,43(9):3259-3272. [40] RYU H,KANG S,KANG H,et al.Semantic grouping network for video captioning[J].Proceedings of the AAAI Conference on Artificial Intelligence,2021,35(3):2514-2522. |