基于私有特征学习与对比学习的多模态情感分析

doi:10.19678/j.issn.1000-3428.0252268

摘要/Abstract

摘要： 在多模态情感分析任务中，传统方法依赖于直接融合多模态信息，而每个模态特有的私有特征往往被跨模态交互所忽略，这可能导致模型在处理复杂情感表达时的准确性和鲁棒性不足。特别是在智慧教育场景中，教师需要通过学生的语音、表情和文本反馈来精准判断其学习状态和情绪波动，因此，提升多模态情感分析的精度对于个性化教学和课堂交互具有重要意义。为了解决这一问题，本研究提出了一种结合私有特征学习和对比学习的情感分析模型。首先，为了充分利用私有特征，该模型将共享特征与原始的文本、音频和视觉特征进行相似性比较，从而识别在跨模态交互中被忽视的私有特征，再通过融合私有特征和共享特征来增强模型的表达能力。其次，提出了一种模态无关对比损失（Modality-Agnostic Contrastive Loss，MACL）,该方法通过对多模态融合特征进行对比学习，有效利用多模态数据中的情感信息，减少模态间的差距，进而获得统一的情感表示。实验结果表明，在CMU-MOSI和CMU-MOSEI数据集上，该模型的F1值分别提升到了85.98%和85.95%，二分类准确率分别提升到了86.01%和85.97%，显著高于次优模型，验证了该模型的有效性。

Abstract: In multimodal sentiment analysis, traditional methods rely on directly fusing multimodal information, while modality-specific private features are often overlooked in cross-modal interactions. This may reduce accuracy and robustness in handling complex sentiment expressions. Particularly in smart education scenarios, teachers need to accurately assess students' learning states and emotional fluctuations by analyzing their speech, facial expressions, and textual feedback. Thus, enhancing the precision of multimodal sentiment analysis is crucial for personalized learning and classroom interaction.To address this issue, this study proposes a sentiment analysis model that integrates private feature learning and contrastive learning. First, to fully leverage private features, the model compares shared features with the original text, audio, and visual features to identify modality-specific information that is often overlooked in cross-modal interactions. The private features are then fused with the shared features to enhance the model’s expressive capability. Second, a Modality-Agnostic Contrastive Loss (MACL) is introduced to perform contrastive learning on the fused multimodal features, effectively capturing sentiment information from different modalities while mitigating cross-modal discrepancies to obtain a unified sentiment representation.Experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed model achieves F1 scores of 85.98% and 85.95%, with binary classification accuracy reaching 86.01% and 85.97%, respectively. These results demonstrate a significant improvement over state-of-the-art models, validating the effectiveness of the proposed approach.

冯广, 项峰, 黄荣灿, 周垣桦, 郑润庭, 杨燕茹, 刘天翔, 李伟辰. 基于私有特征学习与对比学习的多模态情感分析[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252268.

FENG Guang, XIANG Feng, HUANG Rongcan, ZHOU Yuanhua, ZHENG R unting, YANG Yanru, LIU Tianxiang, LI Weichen. Multimodal Sentiment Analysis with Private Feature Learning and Contrastive Learning[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252268.

参考文献

[1]. 冯广,江家懿,罗时强,等.基于话语间时序多模态数据的情绪分析方法[J].计算机系统应用,2022,31(05):195-202. FENG G, JIANG J Y, LUO S Q, et al.Sentiment Analysis Method Based on Temporal Multimodal Data Between Utterances[J].Computer Systems & Applications,2022,31(05):195-202. [2]. 冯广,鲍龙.基于红外可见光融合的复杂环境下人脸识别方法[J].广东工业大学学报,2024,41(03):62-70,109. FENG G, BAO L. Face Recognition Method in Complex Environment Based on Infrared Visible Fusion[J]. Journal of Guangdong University of Technology, 2024, 41(03): 62-70,109. [3]. 吴亚迪, 陈平华. 基于用户长短期偏好和音乐情感注意力的音乐推荐模型[J]. 广东工业大学学报, 2023, 40(04): 37-44. WU Y D, CHEN P H. A Music Recommendation Model Based on Users' Long and Short Term Preferences and Music Emotional Attention[J]. Journal of Guangdong University of Technology, 2023, 40(04): 37-44. [4]. Mai S, Hu H, Xing S. Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the AAAI conference on artificial intelligence. 2020, 34(01): 164-172. [5]. Pandeya Y R, Lee J. Deep learning-based late fusion of multimodal information for emotion classification of music video[J]. Multimedia Tools and Applications, 2021, 80(2): 2887-2905. [6]. Han W, Chen H, Gelbukh A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 international conference on multimodal interaction. 2021: 6-15. [7]. 袁萍萍. 基于对比学习的多模态特征融合情感分析算法研究[D]. 南昌: 南昌大学, 2023. YUAN P P. Research on multimodal feature fusion sentiment analysis algorithm based on contrast learning[D].Nanchang: Nanchang University,2023. [8]. Han W, Chen H, Poria S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021: 9180-9192. [9]. Lin Z, Liang B, Long Y, et al. Modeling intra-and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis[C]//Proceedings of the 29th international conference on computational linguistics. Association for Computational Linguistics, 2022, 29(1): 7124-7135. [10]. Ma F, Zhang Y, Sun X. Multimodal sentiment analysis with preferential fusion and distance-aware contrastive learning[C]//2023 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2023: 1367-1372. [11]. ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88. [12]. Zadeh A A B, Liang P P, Poria S, et al. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2236-2246. [13]. ZADEH A, CHEN M, PORIA S. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017: 1103-1114. [14]. LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors [C]// Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2247-2256. [15]. ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]// Proceedings of the AAAI conference on artificial intelligence. 2018, 32(1). [16]. HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM international conference on multimedia. 2020: 1122-1131. [17]. YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(12): 10790-10797. [18]. TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]// Proceedings of the 57th Annual Meeting of The Association for Computational Linguistics, 2019: 6558-6569. [19]. Wang D, Liu S, Wang Q, et al. Cross-modal enhancement network for multimodal sentiment analysis[J]. IEEE Transactions on Multimedia, 2022, 25: 4909-4921. [20]. RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020:2359-2369. [21]. WANG L, PENG J, ZHENG C, et al. A cross modal hierarchical fusion multimodal sentiment analysis method based on multi-task learning[J]. Information Processing & Management, 2024, 61(3): 103675. [22]. HUANG J, ZHOU J, TANG Z, et al. TMBL: transformer-based multimodal binding learning model for multimodal sentiment analysis[J]. Knowledge-Based Systems, 2024, 285: 111346. [23]. Mai S, Sun Y, Zeng Y, et al. Excavating multimodal correlation for representation learning[J]. Information Fusion, 2023, 91: 542-555. [24]. Lai S, Li J, Guo G, et al. Shared and private information learning in multimodal sentiment analysis with deep modal alignment and self-supervised multi-task learning[C]//2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024: 1-8. [25]. Jaiswal A, Babu A R, Zadeh M Z, et al. A survey on contrastive self-supervised learning[J]. Technologies, 2020, 9(1): 2. [26]. Chen T, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations[C]//International conference on machine learning. PmLR, 2020: 1597-1607. [27]. Gao T, Yao X, Chen D. Simcse: Simple contrastive learning of sentence embeddings[J]. arXiv preprint arXiv:2104.08821, 2021. [28]. Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International conference on machine learning. PmLR, 2021: 8748-8763. [29]. Yang J, Yu Y, Niu D, et al. Confede: Contrastive feature decomposition for multimodal sentiment analysis[C]//Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023: 7617-7630. [30]. Mai S, Zeng Y, Zheng S, et al. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2022, 14(3): 2276-2289. [31]. Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). 2019: 4171-4186. [32]. Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural computation, 1997, 9(8): 1735-1780. [33]. HU Jie, SHEN Li, SUN Gang. Squeeze-and-Excitation Networks[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Salt Lake City, USA: IEEE, 2018: 7132-7141. [34]. WANG F, JIANG M, QIAN C, et al. Residual attention network for image classification[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Honolulu, USA: IEEE, 2017: 3156–3164. [35]. WU T, PENG J, ZHAGN W, et al. Video sentiment analysis with bimodal information-augmented multi-head attention[J]. Knowledge-Based Systems, 2022, 235: 107676. [36]. LIN R, HU H. Multimodal contrastive learning via uni-modal coding and cross-modal prediction for multimodal sentiment analysis[C]// Findings of the Association for Computational Linguistics: EMNLP 2022, 2022: 522-523. [37]. Liu S, Luo Z, Fu W. Fcdnet: fuzzy cognition-based dynamic fusion network for multimodal sentiment analysis[J]. IEEE Transactions on Fuzzy Systems, 2024,33(1): 3-14.

选择文件类型/文献管理软件名称

选择包含的内容