作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (11): 144-151. doi: 10.19678/j.issn.1000-3428.0069721

• 人工智能与模式识别 • 上一篇    下一篇

基于密集协同注意力的多模态情感分析

周世向1, 于凯1,2,*()   

  1. 1. 新疆大学计算机科学与技术学院, 新疆 乌鲁木齐 830017
    2. 新疆财经大学公共管理学院, 新疆 乌鲁木齐 830012
  • 收稿日期:2024-04-10 修回日期:2024-06-18 出版日期:2025-11-15 发布日期:2024-08-21
  • 通讯作者: 于凯
  • 基金资助:
    新疆维吾尔自治区社会科学基金(21BTQ162); 新疆维吾尔自治区重点研发计划项目(2023B01032)

Multimodal Sentiment Analysis Based on Dense Co-Attention

ZHOU Shixiang1, YU Kai1,2,*()   

  1. 1. College of Computer Science and Technology, Xinjiang University, Urumqi 830017, Xinjiang, China
    2. School of Public Administration, Xinjiang University of Finance and Economics, Urumqi 830012, Xinjiang, China
  • Received:2024-04-10 Revised:2024-06-18 Online:2025-11-15 Published:2024-08-21
  • Contact: YU Kai

摘要:

随着社交网络的发展, 人们越来越多地通过语音、文本、视频等多模态数据表达情感。针对传统情感分析方法无法有效处理短视频内容中的情绪表达, 以及现有的多模态情感分析技术存在的诸如准确率较低和模态间交互性不足等问题, 提出一种基于密集协同注意力的多模态情感分析方法(DCA-MSA)。首先利用预训练BERT(Bidirectional Encoder Representations from Transformers)模型、OpenFace 2.0模型、COVAREP工具分别提取文本、视频和音频特征, 然后使用双向长短期记忆网络(BiLSTM)分别对不同特征内部的时序相关性进行建模, 最后通过密集协同注意力机制对不同特征进行融合。实验结果表明, 与一些基线模型相比, 所提出的模型在多模态情感分析任务中具有一定的竞争力: 在CMU-MOSEI数据集上, 二分类准确率最高提升3.7百分点, F1值最高提升3.1百分点; 在CH-SIMS数据集上, 二分类准确率最高提升4.1百分点, 三分类准确率最高提升2.8百分点, F1值最高提升3.9百分点。

关键词: 多模态, 情感分析, 模态交互, 密集协同注意力, 特征融合

Abstract:

With the development of social networks, people are increasingly expressing their emotions through multimodal data, such as audio, text, and video. Traditional sentiment analysis methods struggle to process emotional expressions in short videos effectively, and existing multimodal sentiment analysis techniques face issues such as low accuracy and insufficient interaction between modes. To address these problems, this study proposes a Multimodal Sentiment Analysis method based on Dense Co-Attention (DCA-MSA). First, it utilizes the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model, OpenFace 2.0 model, and COVAREP tool to extract features from text, video, and audio, respectively. It then employs a Bidirectional Long Short-Term Memory (BiLSTM) network to model the temporal correlations within different features separately. Finally, it integrates different features through a dense co-attention mechanism. The experimental results show that the model proposed in this paper is competitive in multimodal sentiment analysis tasks compared to some baseline models: on the CMU-MOSEI dataset, the highest increase in binary classification accuracy is 3.7 percentage points, and the highest increase in F1 value is 3.1 percentage points; on the CH-SIMS dataset, the highest increase in binary classification accuracy is 4.1 percentage points, the highest increase in three-classification accuracy is 2.8 percentage points, and the highest increase in F1 value is 3.9 percentage points.

Key words: multimodal, sentiment analysis, modal interaction, dense co-attention, feature fusion