作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于跨模态增强与时间步门控的多模态情感识别

  • 发布日期:2025-03-13

Multimodal Sentiment Analysis Based on Cross-Modal Enhancement and Time Step Gating

  • Published:2025-03-13

摘要: 多模态情感识别旨在通过融合不同模态(如文本、音频、视频)的信息,提高情感识别的准确性和鲁棒性。然而,现有方法在处理模态间的差异性和互补性、时间序列信息的动态特征捕捉方面仍存在不足,导致情感识别效果不佳。为了解决这些问题,提出了一种基于跨模态增强与时间步门控机制的多模态情感识别模型。首先,该模型通过跨模态交叉注意力机制学习不同模态之间的关联性,增强各模态特征的互补性。通过跨模态的相互作用,模型能够更好地整合来自文本、音频和视频模态的信息,并减少单一模态在情感表达中的不足。随后,利用时间步门控机制对每个时间步的特征权重进行动态调整,从而聚焦于情感信息较为关键的时间步,提升模型的时间序列建模能力。最终,融合后的特征被输入分类器进行情感预测。在公开的CMU-MOSEI和CMU-MOSI多模态情感识别数据集上进行实验评估,实验结果表明,所提模型的情感识别准确率分别达到82.41%和82.6%,相较于当前主流模型如ALMT和TETFN,均有显著提升。证明了跨模态增强与时间步门控机制有效提高了模型的多模态特征融合和时间序列处理能力,验证了该方法在多模态情感识别任务中的有效性与鲁棒性。

Abstract: Multimodal sentiment analysis aims to improve the accuracy and robustness of sentiment detection by integrating information from different modalities such as text, audio, and video. However, existing methods still face challenges in handling the discrepancies and complementarity between modalities, as well as in capturing the dynamic features of temporal sequences, which often result in suboptimal sentiment analysis performance. To address these issues, this paper proposes a multimodal sentiment analysis model based on cross-modal enhancement and a time-step gating mechanism. First, the model employs a cross-modal attention mechanism to learn the correlations between different modalities, enhancing the complementarity of features across modalities. Through the interaction between modalities, the model better integrates information from text, audio, and video, mitigating the limitations of single-modality sentiment expression. Next, a time-step gating mechanism dynamically adjusts the feature weights at each time step, focusing on the critical time steps that contain more relevant sentiment information, thereby improving the model's temporal sequence modeling ability. Finally, the fused features are fed into a classifier for sentiment prediction. Experimental evaluations on the publicly available CMU-MOSEI and CMU-MOSI multimodal sentiment analysis datasets show that the proposed model achieves sentiment analysis accuracies of 82.41% and 82.6%, respectively, significantly outperforming current mainstream models such as ALMT and TETFN. These results demonstrate that the cross-modal enhancement and time-step gating mechanisms effectively improve the model's ability to fuse multimodal features and process temporal sequences, validating the method's effectiveness and robustness in multimodal sentiment analysis tasks.