Research and Implementation of Key Frame Summarization Model for News Short Video

doi:10.19678/j.issn.1000-3428.0065727

Abstract

Abstract:

According to the "sound and picture relationship" theory of communication, news short videos can directly and effectively convey the video content through audio, which belong to a typical voice-dominated video. Existing video summarization technologies ignore the influence of sound and picture relationships on the performance of video content, resulting in an unstable performance for specific types of short video summarization. Aiming at the characteristics of "voice-dominated" news short videos, this paper proposes a Key Frame Summarization model for News Short Video(KFS4NSV)based on the multimodal features semantic similarity. In contrast to the traditional fusion model, which is based on extracting multimodal features, this model constructs a common semantic space and jointly trains image-text pairs by minimizing the contrast loss function to achieve the cross-modal semantic similarity metric between audio text summarization and video frames. In the summarization generation task, the model focuses on image content consistent with the semantic information in the audio and uses the semantic information in the audio to filter relevant key frames and obtain a more accurate short video summarization. The experimental datasets consisted of 450 short CCTV news videos and 385 short Bilibili self-media news videos. The F1 value is introduced to measure the performance of different models, and the experimental results show that the F1 values of the proposed model on two datasets reach 62.8% and 51.2%, respectively, which are 2.1 and 0.8 percentage points higher, respectively, than those obtained using the MSVA model. The proposed model exhibits superior performance in the news short video key frame summarization task.

Key words: sound and picture relationship, voice-dominated theory, multimodal feature, semantic similarity, key frame summarization

摘要：

根据传播学的“声画关系”理论，新闻类短视频通过音频直接有效地传达视频内容，属于典型的“主声说”视频。现有视频摘要技术忽略了声画关系对视频内容表现的影响，导致其在特定类型短视频摘要任务中效果不稳定。针对新闻类短视频“主声”的特点，提出基于多模态特征语义相似性的新闻类短视频关键帧摘要模型。与传统融合模型不同，该模型在提取多模态特征的基础上，构建公共语义空间，通过最小化对比损失函数对图像-文本对进行联合训练，实现音频文本摘要与视频帧之间语义相似性的跨模态度量，在摘要生成任务中重点关注与音频中语义信息描述一致的图像内容，利用音频中的语义信息筛选相关关键帧，得到更准确的短视频摘要。采集450条CCTV新闻短视频和385条Bilibili自媒体新闻短视频组成实验数据集，使用F1值衡量不同模型的性能，实验结果表明，该模型在2个数据集上F1值分别达到62.8%和51.2%，相较于MSVA模型分别提升了2.1和0.8个百分点，在新闻类短视频关键帧摘要任务中具有更好的性能。

关键词: 声画关系, 主声说, 多模态特征, 语义相似性, 关键帧摘要

Xiaodan CUI, Dawei LIU, Yifan LIU, Zhibin ZHAO, Yougui REN, Yongming YAN. Research and Implementation of Key Frame Summarization Model for News Short Video[J]. Computer Engineering, 2023, 49(8): 182-189.

崔晓丹, 刘达维, 刘逸凡, 赵志滨, 任酉贵, 闫永明. 新闻类短视频关键帧摘要模型的研究与实现[J]. 计算机工程, 2023, 49(8): 182-189.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0065727

https://www.ecice06.com/EN/Y2023/V49/I8/182

Figures/Tables 7

References 27

1	朱光烈. "声画结合"论批判(下)──兼做对《对一个定论的异议》批评的回应. 现代传播-北京广播学院学报, 1999, 21(5): 89- 97. URL
	ZHU G L. Criticism on the theory of "combination of sound and picture"(part two): concurrently responding to the criticism of objection to a conclusion. Modern Communication-Journal of Beijing Broadcasting University, 1999, 21(5): 89- 97. URL
2	何国平. 改革开放以来电视新闻节目形态的演化及其声画关系. 中国广播电视学刊, 2009,(2): 30- 31. URL
	HE G P. Evolution of TV news program form and its relationship between sound and picture since reform and opening up. China Radio & TV Academic Journal, 2009,(2): 30- 31. URL
3	朱羽君. 屏幕上的革命——在舟山"电视声画关系"研讨会上的发言[M]//洪民生. 电视声画论集. 北京: 人民出版社, 1993: 306.
	ZHU Y J. The revolution on the screen: speech at the Zhoushan seminar on "the relationship between TV sound and picture"[M]//HONG M S. Anthology of TV sound and picture. Beijing: People's Publishing House, 1993: 306. (in Chinese)
4	RAV-ACHA A, PRITCH Y, PELEG S. Making a long video short: dynamic video synopsis[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2006: 435-441.
5	ZHANG Q, YU S P, ZHOU D S, et al. An efficient method of key-frame extraction based on a cluster algorithm. Journal of Human Kinetics, 2013, 39, 5- 13. doi: 10.2478/hukin-2013-0063
6	AVILA S E F, LOPES A P B, DA LUZ A, et al. VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognition Letters, 2011, 32(1): 56- 68. doi: 10.1016/j.patrec.2010.08.004
7	BELO L, CAETANO C, PATROCÍNIO Z, et al. Graph-based hierarchical video summarization using global descriptors[C]//Proceedings of the 26th International Conference on Tools with Artificial Intelligence. Washington D. C., USA: IEEE Press, 2014: 822-829.
8	LIN J X, ZHONG S H, FARES A. Deep hierarchical LSTM networks with attention for video summarization. Computers & Electrical Engineering, 2022, 97, 107618.
9	JI Z, XIONG K L, PANG Y W, et al. Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology, 2020, 30(6): 1709- 1717. doi: 10.1109/TCSVT.2019.2904996
10	ZHU W C, LU J W, LI J H, et al. DSNet: a flexible detect-to-summarize network for video summarization. IEEE Transactions on Image Processing, 2021, 30, 948- 962. doi: 10.1109/TIP.2020.3039886
11	FAJTL J, SOKEH H S, ARGYRIOU V, et al. Summarizing videos with attention[C]//Proceedings of Asian Conference on Computer Vision. Berlin, Germany: Springer, 2019: 39-54.
12	ZHANG K, CHAO W L, SHA F, et al. Video summarization with long short-term memory[C]//Proceedings of European Conference on Computer Vision. Berlin, Germany: Springer, 2016: 766-782.
13	MAHASSENI B, LAM M, TODOROVIC S. Unsupervised video summarization with adversarial LSTM networks[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 2982-2991.
14	JUNG Y, CHO D, KIM D, et al. Discriminative feature learning for unsupervised video summarization. Proceedings of the AAAI Conference on Artificial Intelligence, 2019, 33(1): 8537- 8544. doi: 10.1609/aaai.v33i01.33018537
15	YAO T, MEI T, RUI Y. Highlight detection with pairwise deep ranking for first-person video summarization[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 982-990.
16	HORI C, HORI T, LEE T Y, et al. Attention-based multimodal fusion for video description[C]//Proceedings of IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2017: 4203-4212.
17	WEI H W, NI B B, YAN Y C, et al. Video summarization via semantic attended networks. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32(1): 216- 223.
18	GHAURI J A, HAKIMOV S, EWERTH R. Supervised video summarization via multiple feature sets with parallel attention[C]//Proceedings of IEEE International Conference on Multimedia and Expo. Washington D. C., USA: IEEE Press, 2021: 1-6.
19	ZHAO B, GONG M G, LI X L. AudioVisual video summarization[EB/OL]. [2022-08-10]. https://arxiv.org/abs/2105.07667v1.
20	OTANI M, NAKASHIMA Y, RAHTU E, et al. Video summarization using deep semantic features[C]//Proceedings of Asian Conference on Computer Vision. Berlin, Germany: Springer, 2017: 361-377.
21	LI Y B, MERIALDO B. Multi-video summarization based on AV-MMR[C]//Proceedings of International Workshop on Content Based Multimedia Indexing. Washington D. C., USA: IEEE Press, 2010: 1-6.
22	YAO Y F. Semantic feature hierarchical clustering algorithm based on improved regional merging strategy. Cluster Computing, 2019, 22(1): 1495- 1503.
23	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2022-08-10]. https://arxiv.org/abs/1810.04805.
24	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2022-08-10]. https://arxiv.org/abs/1409.1556.
25	CHOPRA S, HADSELL R, LECUN Y. Learning a similarity metric discriminatively, with application to face verification[C]//Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2005: 539-546.
26	XU J, MEI T, YAO T, et al. MSR-VTT: a large video description dataset for bridging video and language[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 5288-5296.
27	KINGMA D P, BA J. Adam: a method for stochastic optimization[EB/OL]. [2022-08-10]. https://arxiv.org/abs/1412.6980.

模型	CCTV			Bilibili
模型	P	R	F₁	P	R	F₁
vsLSTM	53.9	53.4	53.6	37.2	36.4	36.8
dppLSTM	54.8	53.6	54.2	38.7	37.6	38.1
SUM-GAN_sup	55.3	56.2	55.7	40.6	41.9	41.2
CSNet_sup	57.6	59.3	58.4	47.6	49.1	48.3
SASUM_sup	58.8	57.9	58.3	44.8	46.5	45.6
MSVA	61.4	60.1	60.7	49.7	51.1	50.4
AVRN	61.0	59.7	60.3	47.2	46.3	46.7
KFS4NSV	62.4	63.3	62.8	51.9	50.6	51.2

模型	CCTV			Bilibili
模型	P	R	F₁	P	R	F₁
vsLSTM	53.9	53.4	53.6	37.2	36.4	36.8
dppLSTM	54.8	53.6	54.2	38.7	37.6	38.1
SUM-GAN_sup	55.3	56.2	55.7	40.6	41.9	41.2
CSNet_sup	57.6	59.3	58.4	47.6	49.1	48.3
SASUM_sup	58.8	57.9	58.3	44.8	46.5	45.6
MSVA	61.4	60.1	60.7	49.7	51.1	50.4
AVRN	61.0	59.7	60.3	47.2	46.3	46.7
KFS4NSV	62.4	63.3	62.8	51.9	50.6	51.2

实验编号	F	W	S	CCTV	Bilibili
1	√	×	×	53.2	36.5
2	√	√	×	55.4	41.8
3	√	√	√	62.8	51.2

实验编号	F	W	S	CCTV	Bilibili
1	√	×	×	53.2	36.5
2	√	√	×	55.4	41.8
3	√	√	√	62.8	51.2

[1]	Jiayi LIN, Hongbin XIA, Yuan LIU. Math Word Problems Solving Model Based on Analogical Learning [J]. Computer Engineering, 2024, 50(7): 63-70.
[2]	LI Xue, WANG Yawen, ZHANG Qianjin. Automatic Naming of Source Code Based on Information Retrieval [J]. Computer Engineering, 2024, 50(6): 304-310.
[3]	YANG Zhenyu, WANG Lei, MA Bo, YANG Yating, DONG Rui, Azmat Anwar, WANG Zhen. A Cross-Lingual Distant Supervision Method for Uyghur and Chinese [J]. Computer Engineering, 2023, 49(2): 271-278.
[4]	WANG Shuyuana,TIAN Shengwei,YU Long,FENG Guanjun,AISHAN Wumaier,LI Pu,ZHAO Jianguo. Identification of Uyghur Event Coreference Relationship Based on Stacked Denoising Autoencoder [J]. Computer Engineering, 2018, 44(6): 305-310.
[5]	WU Xiyu,CHEN Qimai,LIU Hai,HE Chaobo. Collaborative Filtering Recommendation Algorithm Based on Representation Learning of Knowledge Graph [J]. Computer Engineering, 2018, 44(2): 226-232,263.
[6]	JING Qi,DUAN Liguo,LI Aiping,ZHAO Qian. Short Text Correlation Calculation Based on Wikipedia [J]. Computer Engineering, 2018, 44(2): 197-202.
[7]	LI Xiaohong,CAO Lin,SU Yun,MA Huifang. Feature Extension Algorithm Fusing Statistical Information and Semantic Similarity [J]. Computer Engineering, 2017, 43(6): 177-181.
[8]	LUO Yong’en,HU Jicheng,XU Qian. Multimodal Correlation Feature Processing Method Based on Hypergraph [J]. Computer Engineering, 2017, 43(1): 226-230.
[9]	ZENG Jianping,LIU Hua. A Name Recognition Method Based on Clustering Coefficient [J]. Computer Engineering, 2016, 42(7): 203-208.
[10]	JIA Jinglan,DONG Cailin,YU Ying,WANG Jing,ZHANG Lifen. Automatic Composition Optimization Method of Semantic Web Service Based on Backward Tree [J]. Computer Engineering, 2016, 42(4): 215-220.
[11]	MA Leilei,LI Hongwei,LIAN Shiwei,LIANG Rupeng,CHEN Hu. A Strategy of Disaster Focused Crawler Based on Ontology Semantics [J]. Computer Engineering, 2016, 42(11): 50-56.
[12]	YI Junkai,LIU Mufan,WAN Jing. Spam Web Detection Method Based on Topic and Semantic [J]. Computer Engineering, 2015, 41(9): 311-316.
[13]	HU Lingchuan,TAO Xiaopeng. Research on Information Automatic Extraction of User Experience from Customer Reviews [J]. Computer Engineering, 2015, 41(1): 49-53.
[14]	TAO Shu-yi, WANG Ming-wen, WAN Jian-yi, LUO Yuan-sheng, ZUO Jia-li. An Incremental Text Clustering Algorithm Based on Cluster Congruence [J]. Computer Engineering, 2014, 40(6): 195-200.
[15]	WANG Xiaolin,WANG Dong,YANG Sichun,TAI Weipeng,ZHENG Xiao. Word Semantic Similarity Algorithm Based on HowNet [J]. Computer Engineering, 2014, 40(12): 177-181.

Please choose a citation manager

Content to export