作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (8): 182-189. doi: 10.19678/j.issn.1000-3428.0065727

• 图形图像处理 • 上一篇    下一篇

新闻类短视频关键帧摘要模型的研究与实现

崔晓丹1, 刘达维1, 刘逸凡1, 赵志滨1,*, 任酉贵1,2, 闫永明3   

  1. 1. 东北大学 计算机科学与工程学院, 沈阳 110169
    2. 辽宁省自然资源事务服务中心, 沈阳 110001
    3. 沈阳帝信人工智能产业研究院有限公司, 沈阳 110136
  • 收稿日期:2022-09-13 出版日期:2023-08-15 发布日期:2022-12-09
  • 通讯作者: 赵志滨
  • 作者简介:

    崔晓丹(1998-),女,硕士研究生,主研方向为计算机视觉、机器学习

    刘达维,硕士研究生

    刘逸凡,硕士研究生

    任酉贵,博士研究生

    闫永明,博士

Research and Implementation of Key Frame Summarization Model for News Short Video

Xiaodan CUI1, Dawei LIU1, Yifan LIU1, Zhibin ZHAO1,*, Yougui REN1,2, Yongming YAN3   

  1. 1. School of Computer Science and Engineering, Northeastern University, Shenyang 110169, China
    2. Service Center of Natural Resource Affairs of Liaoning Province, Shenyang 110001, China
    3. Shenyang Dixin Artificial Intelligence Industry Research Institute Co., Ltd., Shenyang 110136, China
  • Received:2022-09-13 Online:2023-08-15 Published:2022-12-09
  • Contact: Zhibin ZHAO

摘要:

根据传播学的“声画关系”理论,新闻类短视频通过音频直接有效地传达视频内容,属于典型的“主声说”视频。现有视频摘要技术忽略了声画关系对视频内容表现的影响,导致其在特定类型短视频摘要任务中效果不稳定。针对新闻类短视频“主声”的特点,提出基于多模态特征语义相似性的新闻类短视频关键帧摘要模型。与传统融合模型不同,该模型在提取多模态特征的基础上,构建公共语义空间,通过最小化对比损失函数对图像-文本对进行联合训练,实现音频文本摘要与视频帧之间语义相似性的跨模态度量,在摘要生成任务中重点关注与音频中语义信息描述一致的图像内容,利用音频中的语义信息筛选相关关键帧,得到更准确的短视频摘要。采集450条CCTV新闻短视频和385条Bilibili自媒体新闻短视频组成实验数据集,使用F1值衡量不同模型的性能,实验结果表明,该模型在2个数据集上F1值分别达到62.8%和51.2%,相较于MSVA模型分别提升了2.1和0.8个百分点,在新闻类短视频关键帧摘要任务中具有更好的性能。

关键词: 声画关系, 主声说, 多模态特征, 语义相似性, 关键帧摘要

Abstract:

According to the "sound and picture relationship" theory of communication, news short videos can directly and effectively convey the video content through audio, which belong to a typical voice-dominated video. Existing video summarization technologies ignore the influence of sound and picture relationships on the performance of video content, resulting in an unstable performance for specific types of short video summarization. Aiming at the characteristics of "voice-dominated" news short videos, this paper proposes a Key Frame Summarization model for News Short Video(KFS4NSV)based on the multimodal features semantic similarity. In contrast to the traditional fusion model, which is based on extracting multimodal features, this model constructs a common semantic space and jointly trains image-text pairs by minimizing the contrast loss function to achieve the cross-modal semantic similarity metric between audio text summarization and video frames. In the summarization generation task, the model focuses on image content consistent with the semantic information in the audio and uses the semantic information in the audio to filter relevant key frames and obtain a more accurate short video summarization. The experimental datasets consisted of 450 short CCTV news videos and 385 short Bilibili self-media news videos. The F1 value is introduced to measure the performance of different models, and the experimental results show that the F1 values of the proposed model on two datasets reach 62.8% and 51.2%, respectively, which are 2.1 and 0.8 percentage points higher, respectively, than those obtained using the MSVA model. The proposed model exhibits superior performance in the news short video key frame summarization task.

Key words: sound and picture relationship, voice-dominated theory, multimodal feature, semantic similarity, key frame summarization