生成式补全与动态知识融合的多模态情感分析

doi:10.19678/j.issn.1000-3428.0252352

摘要/Abstract

摘要： 多模态情感分析利用多模态数据来推断人类情感。然而现有模型在应对模态信息缺失、文本依赖及跨模态冲突等情况下性能下降明显。为此提出一种基于生成式补全与动态知识融合的多模态情感分析模型（Generative Completion and Dynamic Knowledge Fusion Model，GC-DKF）。首先模型通过生成式提示学习模块，对原始数据中缺失的模态内及模态间信息进行补全，生成缺失的模态特征，提升模型对不确定模态场景的适应能力。然后设计一种主导模态动态选择机制，依据情感比例因子动态选定主导模态，同时引入知识编码器增强单一模态的表征能力，获取各模态的知识增强表征。最后在主导模态特征的引导下，进一步学习其他次要模态，生成互补性的多模态融合联合表征，实现更为高效、精准的多模态情感分析。在公开的CMU-MOSI和CMU-MOSEI数据集上的实验验证显示，所提模型在二分类准确率、F1分数、平均绝对误差和Pearson相关系数等指标上，均超越现有的主流多模态情感识别方法，情感识别准确率分别高达83.55%和83.02%。这充分证明提出的模型在多模态情感识别任务中具备较强竞争力。

Abstract: Multimodal sentiment analysis utilizes multimodal data to infer human emotions. However, existing models significantly degrade in performance when faced with issues such as modal information loss, text dependency, and cross-modal conflicts. To address this, a Generative Completion and Dynamic Knowledge Fusion Model (GC-DKF) for multimodal sentiment analysis is proposed. First, the model employs a generative prompt learning module to complete the missing intra-modal and inter-modal information in the original data, generating missing modal features to enhance the model's adaptability to uncertain modal scenarios. Then, a dynamic dominant modality selection mechanism is designed to dynamically select the dominant modality based on emotional proportion factors. Meanwhile, a knowledge encoder is introduced to strengthen the representation capability of a single modality, obtaining knowledge-enhanced representations of each modality. Finally, guided by the features of the dominant modality, the model further learns other secondary modalities to generate complementary multimodal fusion joint representations, achieving more efficient and accurate multimodal sentiment analysis. Experiments on the public CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed model outperforms existing mainstream multimodal sentiment recognition methods in terms of metrics such as binary classification accuracy, F1 score, mean absolute error, and Pearson correlation coefficient, with sentiment recognition accuracies reaching as high as 83.55% and 83.02%, respectively. This fully demonstrates that the proposed model has strong competitiveness in multimodal sentiment recognition tasks.

郑洋, 王雷, 盛捷. 生成式补全与动态知识融合的多模态情感分析[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252352.

ZHENG Yang, WANG Lei, SHENG Jie. Multimodal Sentiment Analysis via Generative Completion and Dynamic Knowledge Fusion[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252352.

参考文献

[1] 孙强,王姝玉.结合时间注意力机制和单模态标签自动生成策略的自监督多模态情感识别[J].电子与信息学报, 2024(002):046. Sun Qiang, Wang Shuyu. Self-Supervised multimodal sentiment recognition combining temporal attention mechanism and single-modality label automatic generation strategy[J]. Journal of Electronics and Information Technology, 2024(002): 046.
[2] Jiang Y, Li W, Hossain M S, et al. A snapshot research and implementation of multimodal information fusion for data-driven emotion recognition[J]. Information Fusion, 2020, 53: 209-221.
[3] Tsai Y H H, Bai S, Liang P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the conference. Association for computational linguistics. Meeting. 2019, 2019: 6558.
[4] Li M, Yang D, Lei Y, et al. A unified self-distillation framework for multimodal sentiment analysis with uncertain missing modalities[C]//Proceedings of the AAAI conference on artificial intelligence. 2024, 38(9): 10074-10082.
[5] 殷兵,凌震华,林垠,等.兼容缺失模态推理的情感识别方法[J/OL].计算机应用,1-10[2025-05-19].http://kns.cnki.net/kcms/detail/51.1307.TP.20241213.1133.004.html. Yin Bing, Ling Zhenhua, Lin Yin, et al. A sentiment recognition method compatible with missing modal reasoning [J/OL]. Journal of Computer Applications, 1-10 [2025-05-19]. http://kns.cnki.net/kcms/detail/51.1307.TP.20241213.1133.004.html.
[6] 任楚岚,于振坤,关超,等.基于自适应融合技术的多模态实体对齐模型[J].计算机应用研究,2025,42(01):100-105.DOI:10.19734/j.issn.1001-3695.2024.05.0187. Ren Chulan, Yu Zhenkun, Guan Chao, et al. A multimodal entity alignment model based on adaptive fusion technology [J]. Application Research of Computers, 2025, 42(01): 100-105. DOI: 10.19734/j.issn.1001-3695.2024.05.0187.
[7] Hazarika D, Zimmermann R, Poria S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 1122-1131.
[8] Huang J, Zhou J, Tang Z, et al. TMBL: Transformer-based multimodal binding learning model for multimodal sentiment analysis[J]. Knowledge-Based Systems, 2024, 285: 111346.
[9] Li Z, Zhou Y, Zhang W, et al. AMOA: Global acoustic feature enhanced modal-order-aware network for multimodal sentiment analysis[C]//Proceedings of the 29th International Conference on Computational Linguistics. 2022: 7136-7146.
[10] Rahman W, Hasan M K, Lee S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the conference. Association for computational linguistics. Meeting. 2020, 2020: 2359.
[11] Shi T, Feng W, Shang F, et al. Deep correlated prompting for visual recognition with missing modalities[J]. Advances in Neural Information Processing Systems, 2024, 37: 67446-67466.
[12] Wang H, Chen Y, Ma C, et al. Multi-modal learning with missing modality via shared-specific feature modelling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 15878-15887.
[13] ‘Ma M, Ren J, Zhao L, et al. Smil: Multimodal learning with severely missing modality[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2021, 35(3): 2302-2310.
[14] 王嫄,邓振宇,王佳鑫,等.基于两阶段缺失模态恢复的多模态情感分析方法[J].天津科技大学学报,2025,40(01):57-63+80.DOI:10.13364/j.issn.1672-6510.20230188. Wang Yuan, Deng Zhenyu, Wang Jiaxin, et al. A multimodal sentiment analysis method based on two-stage missing modal recovery [J]. Journal of Tianjin University of Science and Technology, 2025, 40(01): 57-63+80. DOI: 10.13364/j.issn.1672-6510.20230188.
[15] Zhao J, Li R, Jin Q. Missing modality imagination network for emotion recognition with uncertain missing modalities[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 2021: 2608-2618.
[16] Heinzerling B, Inui K. Language models as knowledge bases: On entity representations, storage capacity, and paraphrased queries[J]. arXiv preprint arXiv:2008.09036, 2020.
[17] Khattak M U, Rasheed H, Maaz M, et al. Maple: Multi-modal prompt learning[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023: 19113-19122.
[18] Tsimpoukelli M, Menick J L, Cabi S, et al. Multimodal few-shot learning with frozen language models[J]. Advances in Neural Information Processing Systems, 2021, 34: 200-212.
[19] Lee Y L, Tsai Y H, Chiu W C, et al. Multimodal prompting with missing modalities for visual recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2023: 14943-14952.
[20] Zadeh A, Chen M, Poria S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv preprint arXiv:1707.07250, 2017.
[21] Zadeh A, Liang P P, Mazumder N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1).
[22] Sun H, Wang H, Liu J, et al. CubeMLP: An MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM international conference on multimedia. 2022: 3722-3729.
[23] Williams J, Kleinegesse S, Comanescu R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of Grand Challenge and Workshop on Human Multimodal Language (Challenge-HML). 2018: 11-19.
[24] Liu Z, Shen Y, Lakshminarasimhan V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[J]. arXiv preprint arXiv:1806.00064, 2018.
[25] Zadeh A A B, Liang P P, Poria S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2236-2246.
[26] Tsai Y H H, Liang P P, Zadeh A, et al. Learning factorized multimodal representations[J]. arXiv preprint arXiv:1806.06176, 2018.
[27] Yu W, Xu H, Yuan Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(12): 10790-10797.
[28] Mai S, Zeng Y, Zheng S, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2022, 14(3): 2276-2289.
[29] 王楠,王淇,欧阳丹彤.基于知识蒸馏与动态调整机制的多模态情感分析模型[J/OL].计算机学报,1-21[2025-05-19].http://kns.cnki.net/kcms/detail/11.1826.TP.20250402.1820.003.html. Wang Nan, Wang Qi, Ouyang Dantong, et al. A multimodal sentiment analysis model based on knowledge distillation and dynamic adjustment mechanism [J/OL]. Chinese Journal of Computers, 1-21 [2025-05-19].http://kns.cnki.net/kcms/detail/11.1826.TP.20250402.1820.003.html.
[30] Liu S, Luo Z, Fu W. Fcdnet: Fuzzy Cognition-based Dynamic Fusion Network for Multimoda Sentiment Analysis[J]. IEEE Transactions on Fuzzy Systems, 2025, 33(1): 3-14.
[31] Wang D, Guo X, Tian Y, et al. TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136: 109259.
[32] Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[J]. arXiv preprint arXiv:2109.00412, 2021.
[33] Zhang H, Wang Y, Yin G, et al. Learning language-guided adaptive hyper-modality representation for multimodal sentiment analysis[J]. arXiv preprint arXiv:2310.05804, 2023.
[34] Feng X, Lin Y, He L, et al. Knowledge-Guided Dynamic Modality Attention Fusion Framework for Multimodal Sentiment Analysis[J]. arXiv preprint arXiv:2410.04491, 2024
[35] Degottex G, Kane J, Drugman T, et al. COVAREP—A collaborative voice analysis repository for speech technologies[C]//2014 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 2014: 960-964.
[36] Ekman P , Rosenberg E L .What the face reveals: Basic and applied studies of spontaneous expression using the facial action coding system (FACS), 2nd ed.[J]. 2005, 10.1093/acprof:oso/9780195179644.001.0001:21-38.DOI:10.1093/acprof:oso/9780195179644.003.0002.
[37] Devlin J Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186.
[38] Zadeh A, Zellers R, Pincus E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J]. arXiv preprint arXiv:1606.06259, 2016.

选择文件类型/文献管理软件名称

选择包含的内容