[1] JI J Z, XU C, ZHANG X D, et al.Spatio-temporal memory attention for image captioning[J].IEEE Transactions on Image Processing, 2020, 29:7615-7628. [2] YANG J C, WANG C G, JIANG B, et al.Visual perception enabled industry intelligence:state of the art, challenges and prospects[J].IEEE Transactions on Industrial Informatics, 2021, 17(3):2204-2219. [3] GUO W Y, ZHANG Y, YANG J F, et al.re-attention for visual question answering[J].IEEE Transactions on Image Processing, 2021, 30:6730-6743. [4] LIU F L, WU X, GE S, et al.DiMBERT:learning vision-language grounded representations with disentangled multimodal-attention[J].ACM Transactions on Knowledge Discovery from Data, 2022, 16(1):1-19. [5] ZHANG L, HE Z W, YANG Y, et al.Tasks integrated networks:joint detection and retrieval for image search[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022, 44(1):456-473. [6] QIAO S S, WANG R P, SHAN S G, et al.Deep video code for efficient face video retrieval[J].Pattern Recognition, 2021, 113:107754-107762. [7] WU S M, WIELAND J, FARIVAR O, et al.Automatic alt-text:computer-generated image descriptions for blind users on a social network service[C]//Proceedings of 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing.New York, USA:ACM Press, 2017:1180-1192. [8] GUADARRAMA S, KRISHNAMOORTHY N, MALKARNENKAR G, et al.YouTube2Text:recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2017:2712-2719. [9] PEREZ-MARTIN J, BUSTOS B, REZ J.Improving video captioning with temporal composition of a visual-syntactic embedding[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision.Washington D.C., USA:IEEE Press, 2021:3038-3048. [10] ZHU M J, DUAN C R, YU C B.Video captioning in compressed video[EB/OL].[2021-10-09].https://arxiv.org/abs/2101.00359. [11] SZEGEDY C, LIU W, JIA Y Q, et al.Going deeper with convolutions[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:1-9. [12] LI Y C, ZHOU R G, XU R Q, et al.A quantum deep convolutional neural network for image recognition[J].Quantum Science and Technology, 2020, 5(4):44003-44012. [13] PARK J, WOO S, LEE J Y, et al.A simple and light-weight attention module for convolutional neural networks[J].International Journal of Computer Vision, 2020, 128(4):783-798. [14] YOUSUF H, LAHZI M, SALLOUM S A, et al.A systematic review on sequence-to-sequence learning with neural network and its models[J].International Journal of Electrical and Computer Engineering, 2021, 11(3):2315-2321. [15] OTTER D W, MEDINA J R, KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(2):604-624. [16] XIAO J Q, ZHOU Z Y.Research progress of RNN language model[C]//Proceedings of IEEE International Conference on Artificial Intelligence and Computer Applications.Washington D.C., USA:IEEE Press, 2020:1285-1288. [17] 何俊, 张彩庆, 李小珍, 等.面向深度学习的多模态融合技术研究综述[J].计算机工程, 2020, 46(5):1-11. HE J, ZHANG C Q, LI X Z, et al.Survey of research on multimodal fusion technology for deep learning[J].Computer Engineering, 2020, 46(5):1-11.(in Chinese) [18] VENUGOPALAN S, XU H J, DONAHUE J, et al.Translating videos to natural language using deep recurrent neural networks[EB/OL].[2021-10-09].https://arxiv.org/abs/1412.4729. [19] DONG J F, LI X R, LAN W Y, et al.Early embedding and late reranking for video captioning[C]//Proceedings of the 24th ACM International Conference on Multimedia.Washington D.C., USA:IEEE Press, 2016:1082-1086. [20] YAO L, TORABI A, CHO K, et al.Describing videos by exploiting temporal structure[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4507-4515. [21] VENUGOPALAN S, ROHRBACH M, DONAHUE J, et al.Sequence to sequence-video to text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2015:4534-4542. [22] WANG J B, WANG W, HUANG Y, et al.M3:multimodal memory modelling for video captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2018:7512-7520. [23] GAO L L, LI X P, SONG J K, et al.Hierarchical LSTMs with adaptive attention for visual captioning[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2020, 42(5):1112-1131. [24] WANG N W, LIU H Z, XU C.Deep learning for the detection of COVID-19 using transfer learning and model integration[C]//Proceedings of the 10th International Conference on Electronics Information and Emergency Communication.Washington D.C., USA:IEEE Press, 2020:281-284. [25] GUO L T, LIU J, ZHU X X, et al.Normalized and geometry-aware self-attention network for image captioning[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:10324-10333. [26] HOCHREITER S, SCHMIDHUBER J.Long short-term memory[J].Neural Computation, 1997, 9(8):1735-1780. [27] 单礼岩, 李新伟.基于时空信息特征融合的视频指纹算法[J].计算机工程, 2019, 45(8):260-265, 274. SHAN L Y, LI X W.Video fingerprinting algorithm based on temporal and spatial information feature fusion[J].Computer Engineering, 2019, 45(8):260-265, 274.(in Chinese) [28] HU J, SHEN L, ALBANIE S, et al.Squeeze-and-excitation networks[C]//Proceedings of IEEE Transactions on Pattern Analysis and Machine Intelligence.Washington D.C., USA:IEEE Press, 2018:2011-2023. [29] LIU Q, WANG C.Within-component and between-component multi-kernel discriminating correlation analysis for colour face recognition[J].IET Computer Vision, 2017, 11(8):663-674. [30] ALBADR M A A, TIUN S, AYOB M, et al.Mel-frequency cepstral coefficient features based on standard deviation and principal component analysis for language identification systems[J].Cognitive Computation, 2021, 13(5):1136-1153. [31] YANG N N, DEY N, SHERRATT R S, et al.Recognize basic emotional statesin speech by machine learning techniques using mel-frequency cepstral coefficient features[J].Journal of Intelligent & Fuzzy Systems, 2020, 39(2):1925-1936. [32] 项要杰, 杨俊安, 李晋徽, 等.一种适用于说话人识别的改进Mel滤波器[J].计算机工程, 2013, 39(11):214-217, 222. XIANG Y J, YANG J A, LI J H, et al.An improved mel-frequency filter for speaker recognition[J].Computer Engineering, 2013, 39(11):214-217, 222.(in Chinese) [33] MORENCY L P, BALTRUŠAITIS T.Multimodal machine learning:integrating language, vision and speech[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2017:3-5 [34] XU J, MEI T, YAO T, et al.MSR-VTT:a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2017:5288-5296. [35] ROHRBACH A, ROHRBACH M, TANDON N, et al.A dataset for movie description[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:3202-3212. [36] HUANG Y F, SHIH L P, TSAI C H, et al.Describing video scenarios using deep learning techniques[J].International Journal of Intelligent Systems, 2021, 36(6):2465-2490. [37] VEDANTAM R, ZITNICK C L, PARIKH D.CIDEr:consensus-based image description evaluation[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:4566-4575. [38] DENKOWSKI M, LAVIE A.Meteor universal:language specific translation evaluation for any target language[C]//Proceedings of the 9th Workshop on Statistical Machine Translation.Stroudsburg, USA:Association for Computational Linguistics, 2014:376-380. [39] LIN C Y.Rouge:a package for automatic evaluation of summaries[EB/OL].[2021-10-09].https://www.researchgate.net/publication/224890821_ROUGE_A_Package_for_Automatic_Evaluation_of_summaries. [40] PAPINENI K, ROUKOS S, WARD T, et al.BLEU:a method for automatic evaluation of machine translation[C]//Proceedings of the 40th Annual Meeting on Association for Computational Linguistics.Stroudsburg, USA:Association for Computational Linguistics, 2002:311-318. [41] KINGMA D P, BA J.Adam:a method for stochastic optimization[EB/OL].[2021-10-09].https://ui.adsabs.harvard.edu/abs/2014arXiv1412.6980K. [42] VENUGOPALAN S, XU H J, DONAHUE J, et al.Translating videos to natural language using deep recurrent neural networks[EB/OL].[2021-10-09].https://arxiv.org/abs/1412.4729. [43] WANG X, WU J W, CHEN J K, et al.VaTeX:a large-scale, high-quality multilingual dataset for video-and-language research[C]//Proceedings of IEEE/CVF International Conference on Computer Vision.Washington D.C., USA:IEEE Press, 2019:4580-4590. |