基于跨模态增强与时间步门控的多模态情感识别

doi:10.19678/j.issn.1000-3428.0070508

摘要/Abstract

摘要： 多模态情感识别旨在通过融合不同模态（如文本、音频、视频）的信息，提高情感识别的准确性和鲁棒性。然而，现有方法在处理模态间的差异性和互补性、时间序列信息的动态特征捕捉方面仍存在不足，导致情感识别效果不佳。为了解决这些问题，提出了一种基于跨模态增强与时间步门控机制的多模态情感识别模型。首先，该模型通过跨模态交叉注意力机制学习不同模态之间的关联性，增强各模态特征的互补性。通过跨模态的相互作用，模型能够更好地整合来自文本、音频和视频模态的信息，并减少单一模态在情感表达中的不足。随后，利用时间步门控机制对每个时间步的特征权重进行动态调整，从而聚焦于情感信息较为关键的时间步，提升模型的时间序列建模能力。最终，融合后的特征被输入分类器进行情感预测。在公开的CMU-MOSEI和CMU-MOSI多模态情感识别数据集上进行实验评估，实验结果表明，所提模型的情感识别准确率分别达到82.41%和82.6%，相较于当前主流模型如ALMT和TETFN，均有显著提升。证明了跨模态增强与时间步门控机制有效提高了模型的多模态特征融合和时间序列处理能力，验证了该方法在多模态情感识别任务中的有效性与鲁棒性。

Abstract: Multimodal sentiment analysis aims to improve the accuracy and robustness of sentiment detection by integrating information from different modalities such as text, audio, and video. However, existing methods still face challenges in handling the discrepancies and complementarity between modalities, as well as in capturing the dynamic features of temporal sequences, which often result in suboptimal sentiment analysis performance. To address these issues, this paper proposes a multimodal sentiment analysis model based on cross-modal enhancement and a time-step gating mechanism. First, the model employs a cross-modal attention mechanism to learn the correlations between different modalities, enhancing the complementarity of features across modalities. Through the interaction between modalities, the model better integrates information from text, audio, and video, mitigating the limitations of single-modality sentiment expression. Next, a time-step gating mechanism dynamically adjusts the feature weights at each time step, focusing on the critical time steps that contain more relevant sentiment information, thereby improving the model's temporal sequence modeling ability. Finally, the fused features are fed into a classifier for sentiment prediction. Experimental evaluations on the publicly available CMU-MOSEI and CMU-MOSI multimodal sentiment analysis datasets show that the proposed model achieves sentiment analysis accuracies of 82.41% and 82.6%, respectively, significantly outperforming current mainstream models such as ALMT and TETFN. These results demonstrate that the cross-modal enhancement and time-step gating mechanisms effectively improve the model's ability to fuse multimodal features and process temporal sequences, validating the method's effectiveness and robustness in multimodal sentiment analysis tasks.

王永旗, 王雷. 基于跨模态增强与时间步门控的多模态情感识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0070508.

WANG Yongqi, WANG Lei. Multimodal Sentiment Analysis Based on Cross-Modal Enhancement and Time Step Gating[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0070508.

参考文献

[1] Shi Q, Fan J, Wang Z, et al. Multimodal channel-wise attention transformer inspired by multisensory integration mechanisms of the brain[J]. Pattern Recognition, 2022, 130: 108837.
[2] 潘梦竹, 李千目, 邱天. 深度多模态表示学习的研究综述[J]. 计算机工程与应用, 2023, 59(2): 48-64. PAN M Z, LI Q M, QIU T. Survey of research on deep multimodal representation learning [J]. Computer Engineering and Applications, 2023, 59(2): 48-64. (in Chinese)
[3] Liu Y, Liu L, Guo Y, et al. Learning visual and textual representations for multimodal matching and classification [J]. Pattern Recognition, 2018, 84: 51-67.
[4] 李牧, 杨宇恒, 柯熙政. 基于混合特征提取与跨模态特征预测融合的情感识别模型[J]. 计算机应用, 2024, 44(01): 86-93. Li M, YANG Y H, KE X Z. Emotion recognition model based on hybrid feature extraction and cross-modal feature prediction fusion [J]. Computer Applications,2024, 44(01): 86-93. (in Chinese)
[5] Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint arXiv: 1707.07250, 2017.
[6] Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv preprint arXiv: 1806.00064, 2018.
[7] Zadeh A, Liang P P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]// Proceedings of the AAAI conference on artificial intelligence. Washington, DC, USA: AAAI Press, 2018: 5634-5641.
[8] Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 6558-6569.
[9] 徐志京, 高姗. 基于 Transformer-ESIM 注意力机制的多模态情绪识别[J]. 计算机工程与应用, 2022, 58(10): 132-138. XU Z J, GAO S. Multimodal emotion recognition based on Transformer-ESIM attention mechanism [J]. Computer Engineering and Applications, 2022, 58(10): 132-138. (in Chinese)
[10] Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C] //Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2020: 2359-2369
[11] Devlin J. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv: 1810.04805, 2018.
[12] Vaswani A. Attention is all you need[J]. Advances in Neural Information Processing Systems. New York, USA: Curran Associates, Inc, 2017: 6000-6010.
[13] Hazarika D, Zimmermann R, Poria S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM international conference on multimedia. New York, USA: Association for Computing Machinery, 2020: 1122-1131.
[14] Yu W, Xu H, Yuan Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence. Washington, DC, USA: AAAI Press, 2021: 10790-10797.
[15] Han W, Chen H, Gelbukh A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 international conference on multimodal interaction. New York, USA: Association for Computing Machinery, 2021: 6-15.
[16] Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar: Association for Computational Linguistics, 2014: 1532-1543.
[17] Medsker L R, Jain L. Recurrent neural networks[J]. Design and Applications, 2001, 5(2): 64-67.
[18] 孙仁科, 许靖昊, 皇甫志宇, 等. 基于视觉-语言预训练模型的零样本迁移学习方法综述[J]. 计算机工程, 2024, 50(10): 1-15. SUN R K, XU J H, HUANGFU Z Y, et al. Survey of zero-shot transfer learning methods based on vision-language pre-trained models [J]. Computer Engineering, 2024, 50(10): 1-15. (in Chinese)
[19] Sun C, Qiu X, Xu Y, et al. How to fine-tune bert for text classification?[C]//Chinese computational linguistics: 18th China national conference, CCL 2019. Kunming, China: Springer International Publishing, 2019: 194-206.
[20] Baevski A, Zhou Y, Mohamed A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations[J]. Advances in neural information processing systems, 2020, 33: 12449-12460.
[21] Hsu W N, Bolte B, Tsai Y H H, et al. Hubert: Self-supervised speech representation learning by masked prediction of hidden units[J]. IEEE/ACM transactions on audio, speech, and language processing, 2021, 29: 3451-3460.
[22] Yu W, Xu H, Meng F, et al. Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th annual meeting of the association for computational linguistics. Florence, Italy: Association for Computational Linguistics, 2020: 3718-3727.
[23] Graves A, Graves A. Long short-term memory[J]. Supervised sequence labelling with recurrent neural networks, 2012: 37-45.
[24] Zadeh A, Zellers R, Pincus E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv preprint arXiv:1606.06259, 2016.
[25] Zadeh A A B, Liang P P, Poria S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2018: 2236-2246.
[26] Wang D, Guo X, Tian Y, et al. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[27] Zhang H, Wang Y, Yin G, et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[J]. arXiv preprint arXiv:2310.05804, 2023.
[28] Liu S, Luo Z, Fu W. Fcdnet: Fuzzy Cognition-based Dynamic Fusion Network for Multimodal Sentiment Analysis[J]. IEEE Transactions on Fuzzy Systems, doi: 10.1109/TFUZZ.2024.3407739.
[29] Li Y, Wang Y, Cui Z. Decoupled multimodal distilling for emotion recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Vancouver, BC, Canada: IEEE Computer Society, 2023: 6631-6640.

选择文件类型/文献管理软件名称

选择包含的内容