| 1 | 朱光烈. "声画结合"论批判(下)──兼做对《对一个定论的异议》批评的回应. 现代传播-北京广播学院学报, 1999, 21(5): 89- 97.  URL
 | 
																													
																							|  | ZHU G L. Criticism on the theory of "combination of sound and picture"(part two): concurrently responding to the criticism of objection to a conclusion. Modern Communication-Journal of Beijing Broadcasting University, 1999, 21(5): 89- 97.  URL
 | 
																													
																							| 2 | 何国平. 改革开放以来电视新闻节目形态的演化及其声画关系. 中国广播电视学刊, 2009,(2): 30- 31.  URL
 | 
																													
																							|  | HE G P. Evolution of TV news program form and its relationship between sound and picture since reform and opening up. China Radio & TV Academic Journal, 2009,(2): 30- 31.  URL
 | 
																													
																							| 3 | 朱羽君. 屏幕上的革命——在舟山"电视声画关系"研讨会上的发言[M]//洪民生. 电视声画论集. 北京: 人民出版社, 1993: 306. | 
																													
																							|  | ZHU Y J. The revolution on the screen: speech at the Zhoushan seminar on "the relationship between TV sound and picture"[M]//HONG M S. Anthology of TV sound and picture. Beijing: People's Publishing House, 1993: 306. (in Chinese) | 
																													
																							| 4 | RAV-ACHA A, PRITCH Y, PELEG S. Making a long video short: dynamic video synopsis[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2006: 435-441. | 
																													
																							| 5 | ZHANG Q, YU S P, ZHOU D S, et al. An efficient method of key-frame extraction based on a cluster algorithm. Journal of Human Kinetics, 2013, 39, 5- 13.  doi: 10.2478/hukin-2013-0063
 | 
																													
																							| 6 | AVILA S E F, LOPES A P B, DA LUZ A, et al. VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 2011, 32(1): 56- 68.  doi: 10.1016/j.patrec.2010.08.004
 | 
																													
																							| 7 | BELO L, CAETANO C, PATROCÍNIO Z, et al. Graph-based hierarchical video summarization using global descriptors[C]//Proceedings of the 26th International Conference on Tools with Artificial Intelligence. Washington D. C., USA: IEEE Press, 2014: 822-829. | 
																													
																							| 8 | LIN J X, ZHONG S H, FARES A. Deep hierarchical LSTM networks with attention for video summarization. Computers & Electrical Engineering, 2022, 97, 107618. | 
																													
																							| 9 | JI Z, XIONG K L, PANG Y W, et al. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(6): 1709- 1717.  doi: 10.1109/TCSVT.2019.2904996
 | 
																													
																							| 10 | ZHU W C, LU J W, LI J H, et al. DSNet: a flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 2021, 30, 948- 962.  doi: 10.1109/TIP.2020.3039886
 | 
																													
																							| 11 | FAJTL J, SOKEH H S, ARGYRIOU V, et al. Summarizing videos with attention[C]//Proceedings of Asian Conference on Computer Vision. Berlin, Germany: Springer, 2019: 39-54. | 
																													
																							| 12 | ZHANG K, CHAO W L, SHA F, et al. Video summarization with long short-term memory[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 766-782. | 
																													
																							| 13 | MAHASSENI B, LAM M, TODOROVIC S. Unsupervised video summarization with adversarial LSTM networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2982-2991. | 
																													
																							| 14 | JUNG Y, CHO D, KIM D, et al. Discriminative feature learning for unsupervised video summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8537- 8544.  doi: 10.1609/aaai.v33i01.33018537
 | 
																													
																							| 15 | YAO T, MEI T, RUI Y. Highlight detection with pairwise deep ranking for first-person video summarization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 982-990. | 
																													
																							| 16 | HORI C, HORI T, LEE T Y, et al. Attention-based multimodal fusion for video description[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 4203-4212. | 
																													
																							| 17 | WEI H W, NI B B, YAN Y C, et al. Video summarization via semantic attended networks. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 216- 223. | 
																													
																							| 18 | GHAURI J A, HAKIMOV S, EWERTH R. Supervised video summarization via multiple feature sets with parallel attention[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2021: 1-6. | 
																													
																							| 19 |  | 
																													
																							| 20 | OTANI M, NAKASHIMA Y, RAHTU E, et al. Video summarization using deep semantic features[C]//Proceedings of Asian Conference on Computer Vision. Berlin, Germany: Springer, 2017: 361-377. | 
																													
																							| 21 | LI Y B, MERIALDO B. Multi-video summarization based on AV-MMR[C]//Proceedings of International Workshop on Content Based Multimedia Indexing. Washington D. C., USA: IEEE Press, 2010: 1-6. | 
																													
																							| 22 | YAO Y F. Semantic feature hierarchical clustering algorithm based on improved regional merging strategy. Cluster Computing, 2019, 22(1): 1495- 1503. | 
																													
																							| 23 | DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2022-08-10]. https://arxiv.org/abs/1810.04805 . | 
																													
																							| 24 |  | 
																													
																							| 25 | CHOPRA S, HADSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2005: 539-546. | 
																													
																							| 26 | XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 5288-5296. | 
																													
																							| 27 |  |