作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (6): 258-267. doi: 10.19678/j.issn.1000-3428.0070508

• 多模态与信息融合 • 上一篇    下一篇

基于跨模态增强与时间步门控的多模态情感识别

王永旗, 王雷*()   

  1. 中国科学技术大学信息科学技术学院, 安徽 合肥 230026
  • 收稿日期:2024-10-21 修回日期:2024-12-30 出版日期:2026-06-15 发布日期:2025-03-13
  • 通讯作者: 王雷
  • 作者简介:

    王永旗, 男, 硕士研究生, 主研方向为多模态情感识别

    王雷(通信作者), 副教授

  • 基金资助:
    高技术创新特区项目(20-163-14-LZ-001-004-01)

Multimodal Sentiment Recognition Based on Cross-Modal Enhancement and Time-Step Gating

WANG Yongqi, WANG Lei*()   

  1. School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, Anhui, China
  • Received:2024-10-21 Revised:2024-12-30 Online:2026-06-15 Published:2025-03-13
  • Contact: WANG Lei

摘要:

多模态情感识别旨在通过融合不同模态(如文本、音频、视频)的信息, 提高情感识别的准确性和鲁棒性。然而, 现有方法在处理模态间的差异性和互补性、时间序列信息的动态特征捕捉方面仍存在不足, 导致情感识别效果不佳。为了解决这些问题, 提出一种基于跨模态增强与时间步门控机制的多模态情感识别模型。首先, 该模型通过跨模态交叉注意力机制学习不同模态之间的关联性, 增强各模态特征的互补性。通过跨模态的相互作用, 模型能够更好地整合来自文本、音频和视频模态的信息, 并减少单一模态在情感表达中的不足。随后, 利用时间步门控机制对每个时间步的特征权重进行动态调整, 从而聚焦于情感信息较为关键的时间步, 提升模型的时间序列建模能力。最后, 融合后的特征被输入分类器进行情感预测。在公开的CMU-MOSEI和CMU-MOSI多模态情感识别数据集上进行实验评估, 实验结果表明, 所提模型的情感识别准确率分别达到82.41%和82.60%, 相较于ALMT和TETFN等当前主流模型, 均有显著提升。这证明了跨模态增强与时间步门控机制有效提高了模型的多模态特征融合和时间序列处理能力, 验证了该方法在多模态情感识别任务中的有效性与鲁棒性。

关键词: 多模态情感识别, 注意力机制, 门控机制, 多任务学习, 多模态融合

Abstract:

Multimodal sentiment recognition aims to improve the accuracy and robustness of sentiment detection by integrating information from different modalities such as text, audio, and video. However, existing methods face challenges in handling discrepancies and complementarities between modalities, as well as in capturing the dynamic features of temporal sequences, often resulting in suboptimal sentiment recognition performance. To address these issues, this paper proposes a multimodal sentiment recognition model based on cross-modal enhancement and a time-step gating mechanism. The model employs a cross-modal cross-attention mechanism to learn correlations between different modalities, thereby enhancing the complementarity of features across modalities. The model integrates information from text, audio, and video through interactions between modalities, mitigating the limitations of single-modality sentiment expressions. Subsequently, the time-step gating mechanism dynamically adjusts feature weights at each time-step, focusing on critical time-steps that contain more relevant sentiment information, thereby improving the model's temporal sequence modeling capability. Finally, fused features are fed into a classifier for sentiment prediction. Experimental evaluations on publicly available CMU-MOSEI and CMU-MOSI multimodal sentiment recognition datasets show that the proposed model achieves sentiment recognition accuracies of 82.41% and 82.60%, respectively, significantly outperforming current mainstream models such as ALMT and TETFN. These results demonstrate that cross-modal enhancement and time-step gating mechanisms effectively improve the ability to fuse multimodal features and process temporal sequences, validating the effectiveness and robustness of the method in multimodal sentiment recognition tasks.

Key words: multimodal sentiment recognition, attention mechanism, gating mechanism, multi-task learning, multimodal fusion