融合混合编码与模糊建模的多模态对话情感识别模型

doi:10.19678/j.issn.1000-3428.0260120

摘要/Abstract

摘要： 多模态对话情感识别通过融合语言、声学和视觉等多源信息，实现对话情绪的自动识别，从而增强人机交互的自然性与情感理解。然而，现有方法在建模情感的多层上下文依赖方面仍存在不足，模态融合易引入冗余或噪声，且难以刻画情感的不确定性，限制复杂情绪识别。针对上述问题，提出了一种融合混合编码与模糊建模的多模态对话情感识别模型。该模型通过混合编码模块同时建模情感的全局对话上下文与局部依赖关系，从而增强情感时序特征的表达能力，并在此基础上引入分层门控融合机制，对不同层次和不同模态特征进行动态加权融合，以有效抑制冗余信息与噪声干扰。在情感分类阶段，采用线性等间距初始化的模糊神经网络，通过模糊隶属函数对情感类别边界进行建模，以刻画情绪表达中的不确定性与模糊性。实验结果显示，该模型在 IEMOCAP、MELD 和 CMU-MOSEI 三个数据集上的各项指标均优于基线方法，在 IEMOCAP 上准确率达到 72.67%，MELD 上为 67.37%，CMU-MOSEI 七分类与二分类准确率分别为 54.96% 和 86.78%，验证了所提方法在多模态情感分析中的有效性。

Abstract: Multimodal emotion recognition in conversations integrates language, acoustic, and visual information to automatically identify the emotions in dialogues, thereby enhancing the naturalness and emotional understanding in human-computer interaction. However, existing methods have limitations in modeling multi-layer contextual dependencies of emotions. Multimodal feature fusion often introduces redundant information and noise, and these methods cannot effectively capture the uncertainty of emotions, which limits the recognition of complex emotional categories. To address these issues, this paper proposes a multimodal emotion recognition model that combines hybrid encoding and fuzzy modeling. The model uses a hybrid encoding module to capture both global dialogue context and local utterance-level dependencies, which strengthens the representation of emotional temporal features. In addition, a hierarchical gated fusion mechanism integrates features from different modalities and layers with dynamic weighting to suppress redundancy and noise and improve multimodal feature discrimination. For emotion classification, a fuzzy neural network initialized with linearly spaced parameters models the boundaries of emotion categories using fuzzy membership functions, capturing the uncertainty and fuzziness of emotional expression. Experimental results show that the proposed model outperforms baseline methods on all metrics across the IEMOCAP, MELD, and CMU-MOSEI datasets. It achieves an accuracy of 72.67% on IEMOCAP, 67.37% on MELD, and 54.96% for 7-class accuracy and 86.78% for 2-class accuracy on CMU-MOSEI, respectively, which validates the effectiveness of the proposed method in multimodal sentiment analysis.

钟杭, 张清华, 罗南方, 郭芮利. 融合混合编码与模糊建模的多模态对话情感识别模型[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260120.

ZHONG Hang, ZHANG Qinghua, LUO Nanfang, GUO Ruili. Hybrid Encoding and Fuzzy Modeling for Multimodal Emotion Recognition in Conversations[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260120.

参考文献

[1] RASHKIN H, SMITH E M, LI M, et al. Towards empathetic open-domain conversation models: a new benchmark and dataset[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 5370-5381.
[2] 赵妍妍, 陆鑫, 赵伟翔, 等. 情感对话技术综述[J]. 软件学报, 2024, 35(3): 1377-1402. ZHAO Y Y, LU X, ZHAO W X, et al. Survey on emotional dialogue techniques[J]. Journal of Software, 2023, 35(3): 1377-1402. (in Chinese)
[3] 陈晓婷, 李实. 对话情绪识别综述[J]. 计算机工程与应用, 2023, 59(3): 33-48. CHEN X T, LI S. Survey on emotion recognition in conversation[J]. Computer Engineering and Applications, 2023, 59(3): 33-48. (in Chinese)
[4] JORDAN M I. Serial order: a parallel distributed processing approach[M]. Advances in psychology. [S. l.]: North-Holland, 1997: 471-495.
[5] GORI M, MONFARDINI G, SCARSELLI F. A new model for learning in graph domains[C]//Proceedings of the 2005 IEEE International Joint Conference on Neural Networks. Washington D. C., USA: IEEE Press, 2005: 729-734.
[6] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. Red Hook, USA: Curran Associates Inc., 2017: 6000-6010.
[7] BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42: 335-359.
[8] PORIA S, HAZARIKA D, MAJUMDER N, et al. MELD: a multimodal multi-party dataset for emotion recognition in conversations[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, 2019: 527-536.
[9] BAGHER ZADEH A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: Association for Computational Linguistics, 2018: 2236-2246.
[10] 周钰童, 马志强, 许璧麒, 等. 基于深度学习的对话情绪生成研究综述[J]. 计算机工程与应用, 2024, 60(7): 13-25. ZHOU Y T, MA Z Q, XU B Q, et al. Survey of deep learning-based on emotion generation in conversation[J]. Computer Engineering and Applications, 2024, 60(7): 13-25. (in Chinese)
[11] CHO K, VAN MERRIËNBOER B, GULCEHRE C, et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation[C]// Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing. Doha, Qatar: Association for Computational Linguistics, 2014: 1724-1734.
[12] GRAVES A. Long Short-Term Memory[M]//GRAVES A. Supervised sequence labelling with recurrent neural networks. Berlin, Germany: Springer Berlin Heidelberg, 2012: 37-45.
[13] PORIA S, CAMBRIA E, HAZARIKA D, et al. Context-dependent sentiment analysis in user-generated videos[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics. 2017: 873-883.
[14] MAJUMDER N, PORIA S, HAZARIKA D, et al. DialogueRNN: an attentive RNN for emotion detection in conversations[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2019, 33(01): 6818-6825.
[15] SHENOY A, SARDANA A. Multilogue-Net: A context-aware RNN for multi-modal emotion detection and sentiment analysis in conversation[C]//Proceedings of the Second Grand-Challenge and Workshop on Multimodal Language (Challenge-HML). Seattle, USA: Association for Computational Linguistics, 2020: 19-28.
[16] HU D, WEI L, HUAI X Y. DialogueCRN: contextual reasoning networks for emotion recognition in conversations[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Washington D. C., USA: Association for Computational Linguistics, 2021: 7042-7052.
[17] GHOSAL D, MAJUMDER N, PORIA S, et al. DialogueGCN: a graph convolutional neural network for emotion recognition in conversation[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Washington D. C., USA: Association for Computational Linguistics, 2019: 154-164.
[18] FENG J W, FAN X Y. Cross-modal context fusion and adaptive graph convolutional network for multimodal conversational emotion recognition [C]//2025 International Joint Conference on Neural Networks (IJCNN). Rome, Italy: IEEE, 2025:1-8.
[19] LI J Y, JI DH, LI F, et al. HiTrans: a transformer-based context-and speaker-sensitive model for emotion detection in conversations[C]//Proceedings of the 28th International Conference on Computational Linguistics. [S. l.]: [s. n.], 2020: 4190-4200.
[20] FU Y M, WU J J, WANG Z J, et al. LaERC-S: improving LLM-based emotion recognition in conversation with speaker characteristics [C]//Proceedings of the 31st International Conference on Computational Linguistics. Abu Dhabi, UAE: Association for Computational Linguistics, 2025: 6748-6761.
[21] ZADEH L A. Fuzzy sets[J]. Information and Control, 1965, 8(3): 338-353.
[22] NGUYEN T L, KAVURI S, LEE M. A multimodal convolutional neuro-fuzzy network for emotion understanding of movie clips[J]. Neural Networks, 2019, 118: 208-219.
[23] CHATURVEDI I, SATAPATHY R, CAVALLARI S, et al. Fuzzy commonsense reasoning for multimodal sentiment analysis[J]. Pattern Recognition Letters, 2019, 125: 264-270.
[24] JIANG D Z, LIU H, WEI R G, et al. CSAT-FTCN: a fuzzy-oriented model with contextual self-attention network for multimodal emotion recognition[J]. Cognitive Computation, 2023, 15(3): 1082-1091.
[25] NGUYEN N M, NGUYEN T M, NGUYEN T T, et al. Enhancing multimodal emotion recognition with dynamic fuzzy membership and attention fusion[J]. Engineering Applications of Artificial Intelligence, 2026, 165: 113396.
[26] HU D, HOU X L, WEI L W, et al. MM-DFN: multimodal dynamic fusion network for emotion recognition in conversations[C]//ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). [S. l.]: IEEE, 2022: 7037-7041.
[27] MENG T, ZHANG F C, SHOU Y T, et al. Masked graph learning with recurrent alignment for multimodal emotion recognition in conversation[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024, 32: 4298-4312.
[28] DAI Y J, LI J X, LI Y J, et al. Multi-modal graph context extraction and consensus-aware learning for emotion recognition in conversation[J]. Knowledge-Based Systems, 2024, 298: 111954.
[29] LI X R, XU X J, QIAO J Q, et al. Long-short distance graph neural networks and improved curriculum learning for emotion recognition in conversation [M]//ECAI 2025. [S. l.]: IOS Press, 2025.
[30] FU C Z, QIAN F K, SU K F, et al. HiMul-LGG: a hierarchical decision fusion-based local–global graph neural network for multimodal emotion recognition in conversation[J]. Neural Networks, 2025, 181: 106764.
[31] XIE X T, MA T H, JIA L, et al. Hypergraph-based multimodal adaptive fusion for emotion recognition in conversation[J]. The Journal of Supercomputing, 2025, 81(10): 1128.
[32] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]// Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen, Denmark: Association for Computational Linguistics, 2017: 1103-1114.
[33] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality- specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: Association for Computational Linguistics, 2018: 2247-2256.
[34] HUANG J H, ZHOU J, TANG Z C, et al. TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis[J]. Knowledge-Based Systems, 2024, 285: 111346.
[35] ZHU L N, ZHAO H Y, ZHU Z C, et al. Multimodal sentiment analysis with unimodal label generation and modality decomposition[J]. Information Fusion, 2025, 116: 102787.
[36] CHEN J W, SONG S X, TAN Y M, et al. TEMSA: text enhanced modal representation learning for multimodal sentiment analysis[J]. Computer Vision and Image Understanding, 2025,258:104391.

选择文件类型/文献管理软件名称

选择包含的内容