[1]李伟健, 胡慧君. 基于潜在特征增强网络的视频描述生成方法[J]. 计算机工程, 2024, 50(2): 266-272.
Li W J, Hu H J. Video description generation method based on latent feature augmented network[J]. Computer Engineering, 2024, 50(2): 266-272.(in Chinese)
[2]张浩萌, 刘斌. 融合语义信息和视觉推理特征的视频描述方法[J]. 小型微型计算机系统, 2024, 45(2): 470-476.
Zhang H M, Liu B. Video captioning method fusing semantic information and visual reasoning features[J]. Journal of Chinese Computer Systems, 2024, 45(2): 470-476.(in Chinese)
[3]Liu Y, Zhu H, Wu Z, et al. Adaptive semantic guidance network for video captioning[J]. Computer Vision and Image Understanding, 2025, 251: 104255-104265.
[4]Zeng P, Zhang H, Gao L, et al. Visual commonsense-aware representation network for video captioning[J]. IEEE Transactions on Neural Networks and Learning Systems, 2025, 36(1): 1092-1103.
[5]Shen W, Song J, Zhu X, et al. End-to-end pre-training with hierarchical matching and momentum contrast for text-video retrieval[J]. IEEE Transactions on Image Processing, 2023, 32: 5017-5030.
[6]Krishnamoorthy N, Malkarnenkar G, Mooney R, et al. Generating natural-language video descriptions using text-mined knowledge[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, USA: AAAI Press, 2013, 27(1): 541-547.
[7]Jing S, Zhang H, Zeng P, et al. Memory-based augmentation network for video captioning[J]. IEEE Transactions on Multimedia, 2023, 26: 2367-2379.
[8]Tu Y, Zhou C, Guo J, et al. Relation-aware attention for video captioning via graph learning[J]. Pattern Recognition, 2023, 136: 109-204.
[9]Aming W, Yahong H, Yi Y, et al. Convolutional reconstruction-to-sequence for video captioning.[J], IEEE Transactions on Circuits and Systems for Video Technology, 2019, 30(11): 4299-4308.
[10]Liqi Y, Siqi M, Qifan W, et al. Video captioning using global-local representation[J], IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6642-6656.
[11]Pan B, Cai H, Huang D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2020: 10870-10879.
[12]Zhang Z, Shi Y, Yuan C, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288.
[13]Gao L, Lei Y, Zeng P, et al. Hierarchical representation network with auxiliary tasks for video captioning and video question answering[J]. IEEE Transactions on Image Processing, 2021, 31: 202-215.
[14]Tu Y, Zhou C, Guo J, et al. Enhancing the alignment between target words and corresponding frames for video captioning[J]. Pattern Recognition, 2021, 111: 107-702.
[15]Wu B, Niu G, Yu J, et al. Towards knowledge-aware video captioning via transitive visual relationship detection[J]. IEEE Transactions on Circuits and Systems for Video Technology, 2022, 32(10): 6753-6765.
[16]Zhong X, Li Z, Chen S, et al. Refined semantic enhancement towards frequency diffusion for video captioning[C]//Proceedings of the AAAI conference on artificial intelligence. Palo Alto, CA: AAAI Press, 2023, 37(3): 3724-3732.
[17]Zhang H, Gao L, Zeng P, et al. Depth-aware sparse transformer for video-language learning[C]//Proceedings of the 31st ACM International Conference on Multimedia. New York, USA: ACM Press, 2023: 4778-4787.
[18]Gu X, Chen G, Wang Y, et al. Text with knowledge graph augmented transformer for video captioning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2023: 18941-18951. [19] Li L, Gao X, Deng J, et al. Long short-term relation transformer with global gating for video captioning[J]. IEEE Transactions on Image Processing, 2022, 31: 2726-2738.
[20]Zhao H, Chen Z, Yang Y. Multi-scale features with temporal information guidance for video captioning[J]. Engineering Applications of Artificial Intelligence, 2024, 137: 109-102.
[21]Yang B, Cao M, Zou Y. Concept-aware video captioning: Describing videos with effective prior information[J]. IEEE Transactions on Image Processing, 2023, 32: 5366-5378.
[22]Yu Y, Ko H, Choi J, et al. End-to-end concept word detection for video captioning, retrieval, and question answering[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Washington D. C. , USA: IEEE Press, 2017: 3165-3173.
[23]Xu Y, Yang J, Mao K. Semantic-filtered Soft-Split-Aware video captioning with audio-augmented feature[J]. Neurocomputing, 2019, 357: 24-35.
[24]Sun L, Li B, Yuan C, et al. Multimodal semantic attention network for video captioning[C]//2019 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2019: 1300-1305.
[25]Gabeur V, Sun C, Alahari K, et al. Multi-modal transformer for video retrieval[C]//European Conference on Computer Vision. Berlin, German: Springer Press, 2020: 214-229.
[26]Wang X, Zhu L, Yang Y. T2VLAD: global-local sequence alignment for text-video retrieval[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2021: 5075-5084.
[27]Lei J, Li L, Zhou L, et al. Less is more: Clipbert for video-and-language learning via sparse sampling[C]//2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2021: 7327-7337.
[28]Zhao S, Zhu L, Wang X, et al. Centerclip: Token clustering for efficient text-video retrieval[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2022: 970-981.
[29]Wang X, Zhu L, Zheng Z, et al. Align and tell: Boosting text-video retrieval with local alignment and fine-grained supervision[J]. IEEE Transactions on Multimedia, 2022, 25: 6079-6089.
[30]Luo X, Luo X, Wang D, et al. Global semantic enhancement network for video captioning[J]. Pattern Recognition, 2024, 145: 109-906.
[31]Ryu H, Kang S, Kang H, et al. Semantic grouping network for video captioning[C]//proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2021, 35(3): 2514-2522.
[32]Chen S, Jiang Y G. Motion guided region message passing for video captioning[C]//Proceedings of the IEEE/CVF international conference on computer vision. Washington D. C. , USA: IEEE Press, 2021: 1543-1552.
|