| 1 | 付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法. 科学技术与工程, 2021, 21(14): 5855- 5861. | 
																													
																							|  | FU Y, MA Y, YE O. Video captioning method combining deep networks and visual text. Science Technology and Engineering, 2021, 21(14): 5855- 5861. | 
																													
																							| 2 | 汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2022, 48(2): 375- 397. | 
																													
																							|  | TANG P J, WANG H L. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2022, 48(2): 375- 397. | 
																													
																							| 3 | ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8319-8328. | 
																													
																							| 4 | PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2020: 1-10. | 
																													
																							| 5 | CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229. | 
																													
																							| 6 | SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-V4, Inception ResNet and the impact of Residual connections on learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2017: 1-10. | 
																													
																							| 7 | CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?a new model and the kinetics dataset[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 6299-6308. | 
																													
																							| 8 | REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.  doi: 10.1109/TPAMI.2016.2577031
 | 
																													
																							| 9 | XU Y J, HAN Y H, HONG R C, et al. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933- 4944.  doi: 10.1109/TIP.2018.2846664
 | 
																													
																							| 10 | WANG H Y, XU Y J, HAN Y H. Spotting and aggregating salient regions for video captioning[C]//Proceedings of the 26th International Conference on Multimedia. New York, USA: ACM Press, 2018: 1519-1526. | 
																													
																							| 11 | ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13096-13105. | 
																													
																							| 12 | WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 1-10. | 
																													
																							| 13 | KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171- 184.  doi: 10.1023/A:1020346032608
 | 
																													
																							| 14 |  | 
																													
																							| 15 | DAS P, XU C L, DOELL R F, et al. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2013: 2634-2641. | 
																													
																							| 16 |  | 
																													
																							| 17 | YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2015: 4507-4515. | 
																													
																							| 18 | CHEN Y Y, WANG S H, ZHANG W G, et al. Less is more: picking informative frames for video captioning[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 367-384. | 
																													
																							| 19 | WANG J B, WANG W, HUANG Y, et al. M3: multimodal memory modelling for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 7512-7520. | 
																													
																							| 20 | PEI W J, ZHANG J Y, WANG X R, et al. Memory-attended recurrent network for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8347-8356. | 
																													
																							| 21 |  | 
																													
																							| 22 | BAI Y, WANG J Y, LONG Y, et al. Discriminative latent semantic graph for video captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 3556-3564. | 
																													
																							| 23 | RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2021: 2514-2522. | 
																													
																							| 24 | ZHANG Z Q, QI Z A, YUAN C F, et al. Open-Book video captioning with retrieve-copy-generate network[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 9837-9846. | 
																													
																							| 25 | CHEN J W, PAN Y W, LI Y H, et al. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(1s): 1- 24. | 
																													
																							| 26 | LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2022: 17949-17958. | 
																													
																							| 27 | AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 12487-12496. | 
																													
																							| 28 | ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288. | 
																													
																							| 29 | DUTA I, NICOLICIOIU A L, LEORDEANU M. Discovering dynamic salient regions for spatio-temporal graph neural networks[C]//Proceedings of the 35th Conference on Neural Information Processing Systems. New York, USA: [s. n], 2021: 1-10. | 
																													
																							| 30 |  | 
																													
																							| 31 | HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 770-778. | 
																													
																							| 32 | 侯静怡, 齐雅昀, 吴心筱, 等. 跨语言知识蒸馏的视频中文字幕生成. 计算机学报, 2021, 44(9): 1907- 1921. | 
																													
																							|  | HOU J Y, QI Y Y, WU X X, et al. Cross-lingual knowledge distillation for Chinese video captioning. Chinese Journal of Computers, 2021, 44(9): 1907- 1921. |