作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 218-227. doi: 10.19678/j.issn.1000-3428.0067874

• 图形图像处理 • 上一篇    下一篇

面向视频数据的多模态情感分析

武星1, 殷浩宇1, 姚骏峰2, 李卫民1, 钱权1   

  1. 1. 上海大学计算机工程与科学学院, 上海 200444;
    2. 中国船舶集团海舟系统技术有限公司, 上海 200010
  • 收稿日期:2023-06-16 修回日期:2023-09-05 发布日期:2023-10-30
  • 通讯作者: 武星,E-mail:xingwu@shu.edu.cn E-mail:xingwu@shu.edu.cn
  • 基金资助:
    国家自然科学基金重点项目(61936001);上海市启明星项目(21QB1401900)。

Multimodal Sentiment Analysis for Video Data

WU Xing1, YIN Haoyu1, YAO Junfeng2, LI Weimin1, QIAN Quan1   

  1. 1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China;
    2. CSSC Seago System Technology Co., Ltd., Shanghai 200010, China
  • Received:2023-06-16 Revised:2023-09-05 Published:2023-10-30

摘要: 多模态情感分析旨在从文本、图像和音频数据中提取和整合语义信息,从而识别在线视频中说话者的情感状态。尽管多模态融合方案在此研究领域已取得一定成果,但是已有方法在处理模态间分布差异和关系知识的融合方面仍有欠缺,为此,提出一种多模态情感分析方法。设计一种多模态提示门(MPG)模块,其能够将非语言信息转换为融合文本上下文的提示,利用文本信息对非语言信号的噪声进行过滤,得到包含丰富语义信息的提示,以增强模态间的信息整合。此外,提出一种实例到标签的对比学习框架,在语义层面上区分隐空间中的不同标签以进一步优化模型输出。在3个大规模情感分析数据集上的实验结果表明,该方法的二分类精度相对次优模型提高了约0.7%,三分类精度提高了超过2.5%,达到0.671。该方法能够为将多模态情感分析引入用户画像、视频理解、AI面试等领域提供参考。

关键词: 多模态情感分析, 语义信息, 多模态融合, 上下文表征, 对比学习

Abstract: Multimodal sentiment analysis aims to extract and integrate semantic information from text, images, and audio data in order to identify the emotional states of speakers in online videos. Although, multimodal fusion methods have shown definite outcomes in this research area, previous studies have not adequately addressed the distribution differences between modes and the fusion of relational knowledge. Therefore, a multimodal sentiment analysis method is recommended. In this context, this study proposes the design of a Multimodal Prompt Gate (MPG) module. The proposed module can convert nonverbal information into prompts that fuse the context, filter the noise of nonverbal signals using text information, and obtain prompts containing rich semantic information to enhance information integration between the modalities. In addition, a contrastive learning framework from instance to label is proposed. This framework is used to distinguish the different labels in latent space at the semantic level to further optimize the model output. Experiments on three large-scale sentiment analysis datasets are conducted. The results show that the binary classification accuracy of the proposed method improves by approximately 0.7% compared to the suboptimal model, and the ternary classification accuracy improves by more than 2.5%, reaching 0.671. This method can provide a reference for introducing multimodal sentiment analysis in the fields of user profiling, video understanding, and AI interviews.

Key words: multimodal sentiment analysis, semantic information, multimodal fusion, contextual representation, contrastive learning

中图分类号: