1 |
付燕, 马钰, 叶鸥. 融合深度学习和视觉文本的视频描述方法. 科学技术与工程, 2021, 21(14): 5855- 5861.
|
|
FU Y, MA Y, YE O. Video captioning method combining deep networks and visual text. Science Technology and Engineering, 2021, 21(14): 5855- 5861.
|
2 |
汤鹏杰, 王瀚漓. 从视频到语言: 视频标题生成与描述研究综述. 自动化学报, 2022, 48(2): 375- 397.
|
|
TANG P J, WANG H L. From video to language: survey of video captioning and description. Acta Automatica Sinica, 2022, 48(2): 375- 397.
|
3 |
ZHANG J C, PENG Y X. Object-aware aggregation with bidirectional temporal graph for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8319-8328.
|
4 |
PAN B X, CAI H Y, HUANG D A, et al. Spatio-temporal graph for video captioning with knowledge distillation[C]//Proceedings of Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C. , USA: IEEE Press, 2020: 1-10.
|
5 |
CARION N, MASSA F, SYNNAEVE G, et al. End-to-end object detection with Transformers[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2020: 213-229.
|
6 |
SZEGEDY C, IOFFE S, VANHOUCKE V, et al. Inception-V4, Inception ResNet and the impact of Residual connections on learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2017: 1-10.
|
7 |
CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?a new model and the kinetics dataset[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2017: 6299-6308.
|
8 |
REN S Q, HE K M, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137- 1149.
doi: 10.1109/TPAMI.2016.2577031
|
9 |
XU Y J, HAN Y H, HONG R C, et al. Sequential video VLAD: training the aggregation locally and temporally. IEEE Transactions on Image Processing, 2018, 27(10): 4933- 4944.
doi: 10.1109/TIP.2018.2846664
|
10 |
WANG H Y, XU Y J, HAN Y H. Spotting and aggregating salient regions for video captioning[C]//Proceedings of the 26th International Conference on Multimedia. New York, USA: ACM Press, 2018: 1519-1526.
|
11 |
ZHENG Q, WANG C Y, TAO D C. Syntax-aware action targeting for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13096-13105.
|
12 |
WANG B R, MA L, ZHANG W, et al. Controllable video captioning with POS sequence guidance based on gated fusion network[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2019: 1-10.
|
13 |
KOJIMA A, TAMURA T, FUKUNAGA K. Natural language description of human activities from video images based on concept hierarchy of actions. International Journal of Computer Vision, 2002, 50(2): 171- 184.
doi: 10.1023/A:1020346032608
|
14 |
|
15 |
DAS P, XU C L, DOELL R F, et al. A thousand frames in just a few words: lingual description of videos through latent topics and sparse object stitching[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2013: 2634-2641.
|
16 |
|
17 |
YAO L, TORABI A, CHO K, et al. Describing videos by exploiting temporal structure[C]//Proceedings of International Conference on Computer Vision. Washington D. C. , USA: IEEE Press, 2015: 4507-4515.
|
18 |
CHEN Y Y, WANG S H, ZHANG W G, et al. Less is more: picking informative frames for video captioning[C]//Proceedings of the European Conference on Computer Vision. Berlin, Germany: Springer, 2018: 367-384.
|
19 |
WANG J B, WANG W, HUANG Y, et al. M3: multimodal memory modelling for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2018: 7512-7520.
|
20 |
PEI W J, ZHANG J Y, WANG X R, et al. Memory-attended recurrent network for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 8347-8356.
|
21 |
|
22 |
BAI Y, WANG J Y, LONG Y, et al. Discriminative latent semantic graph for video captioning[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York, USA: ACM Press, 2021: 3556-3564.
|
23 |
RYU H, KANG S, KANG H, et al. Semantic grouping network for video captioning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l.]: AAAI Press, 2021: 2514-2522.
|
24 |
ZHANG Z Q, QI Z A, YUAN C F, et al. Open-Book video captioning with retrieve-copy-generate network[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2021: 9837-9846.
|
25 |
CHEN J W, PAN Y W, LI Y H, et al. Retrieval augmented convolutional encoder-decoder networks for video captioning. ACM Transactions on Multimedia Computing, Communications, and Applications, 2023, 19(1s): 1- 24.
|
26 |
LIN K, LI L J, LIN C C, et al. SwinBERT: end-to-end Transformers with sparse attention for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2022: 17949-17958.
|
27 |
AAFAQ N, AKHTAR N, LIU W, et al. Spatio-temporal dynamics and semantic attribute enriched visual encoding for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2019: 12487-12496.
|
28 |
ZHANG Z Q, SHI Y Y, YUAN C F, et al. Object relational graph with teacher-recommended learning for video captioning[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2020: 13278-13288.
|
29 |
DUTA I, NICOLICIOIU A L, LEORDEANU M. Discovering dynamic salient regions for spatio-temporal graph neural networks[C]//Proceedings of the 35th Conference on Neural Information Processing Systems. New York, USA: [s. n], 2021: 1-10.
|
30 |
|
31 |
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of Conference on Computer Vision and Pattern Recognition. Washington D. C. , USA: IEEE Press, 2016: 770-778.
|
32 |
侯静怡, 齐雅昀, 吴心筱, 等. 跨语言知识蒸馏的视频中文字幕生成. 计算机学报, 2021, 44(9): 1907- 1921.
|
|
HOU J Y, QI Y Y, WU X X, et al. Cross-lingual knowledge distillation for Chinese video captioning. Chinese Journal of Computers, 2021, 44(9): 1907- 1921.
|