基于密集协同注意力的多模态情感分析

doi:10.19678/j.issn.1000-3428.0069721

摘要/Abstract

摘要：

随着社交网络的发展, 人们越来越多地通过语音、文本、视频等多模态数据表达情感。针对传统情感分析方法无法有效处理短视频内容中的情绪表达, 以及现有的多模态情感分析技术存在的诸如准确率较低和模态间交互性不足等问题, 提出一种基于密集协同注意力的多模态情感分析方法(DCA-MSA)。首先利用预训练BERT(Bidirectional Encoder Representations from Transformers)模型、OpenFace 2.0模型、COVAREP工具分别提取文本、视频和音频特征, 然后使用双向长短期记忆网络(BiLSTM)分别对不同特征内部的时序相关性进行建模, 最后通过密集协同注意力机制对不同特征进行融合。实验结果表明, 与一些基线模型相比, 所提出的模型在多模态情感分析任务中具有一定的竞争力: 在CMU-MOSEI数据集上, 二分类准确率最高提升3.7百分点, F1值最高提升3.1百分点; 在CH-SIMS数据集上, 二分类准确率最高提升4.1百分点, 三分类准确率最高提升2.8百分点, F1值最高提升3.9百分点。

关键词: 多模态, 情感分析, 模态交互, 密集协同注意力, 特征融合

Abstract:

With the development of social networks, people are increasingly expressing their emotions through multimodal data, such as audio, text, and video. Traditional sentiment analysis methods struggle to process emotional expressions in short videos effectively, and existing multimodal sentiment analysis techniques face issues such as low accuracy and insufficient interaction between modes. To address these problems, this study proposes a Multimodal Sentiment Analysis method based on Dense Co-Attention (DCA-MSA). First, it utilizes the pre-trained Bidirectional Encoder Representations from Transformers (BERT) model, OpenFace 2.0 model, and COVAREP tool to extract features from text, video, and audio, respectively. It then employs a Bidirectional Long Short-Term Memory (BiLSTM) network to model the temporal correlations within different features separately. Finally, it integrates different features through a dense co-attention mechanism. The experimental results show that the model proposed in this paper is competitive in multimodal sentiment analysis tasks compared to some baseline models: on the CMU-MOSEI dataset, the highest increase in binary classification accuracy is 3.7 percentage points, and the highest increase in F1 value is 3.1 percentage points; on the CH-SIMS dataset, the highest increase in binary classification accuracy is 4.1 percentage points, the highest increase in three-classification accuracy is 2.8 percentage points, and the highest increase in F1 value is 3.9 percentage points.

Key words: multimodal, sentiment analysis, modal interaction, dense co-attention, feature fusion

周世向, 于凯. 基于密集协同注意力的多模态情感分析[J]. 计算机工程, 2025, 51(11): 144-151.

ZHOU Shixiang, YU Kai. Multimodal Sentiment Analysis Based on Dense Co-Attention[J]. Computer Engineering, 2025, 51(11): 144-151.

https://www.ecice06.com/CN/Y2025/V51/I11/144

图/表 7

参考文献 31

1	黄寄. 基于多模态信息融合的情感分析机制研究与实现[D]. 南京: 南京邮电大学, 2022.
	HUANG J. Research and implementation of emotion analysis mechanism based on multimodal information fusion[D]. Nanjing: Nanjing University of Posts and Telecommunications, 2022. (in Chinese)
2	刘继明, 张培翔, 刘颖, 等. 多模态的情感分析技术综述. 计算机科学与探索, 2021, 15 (7): 1165- 1182.
	LIU J M , ZHANG P X , LIU Y , et al. Summary of multi-modal sentiment analysis technology. Journal of Frontiers of Computer Science and Technology, 2021, 15 (7): 1165- 1182.
3	NGIAM J, KHOSLA A, KIM M, et al. Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning. New York, USA: ACM Press, 2011: 689-696.
4	DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL]. [2024-02-10]. https://arxiv.org/abs/1810.04805v2.
5	DEGOTTEX G, KANE J, DRUGMAN T, et al. COVAREP—a collaborative voice analysis repository for speech technologies[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2014: 960-964.
6	BALTRUSAITIS T, ZADEH A, LIM Y C, et al. OpenFace 2.0: facial behavior analysis toolkit[C]//Proceedings of the 13th IEEE International Conference on Automatic Face and Gesture Recognition. Washington D. C., USA: IEEE Press, 2018: 59-66.
7	ZADEH A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Stroudsburg, USA: ACL, 2018: 2236-2246.
8	YU W, XU H, MENG F, et al. CH-SIMS: a Chinese multimodal sentiment analysis dataset with fine-grained annotation of modality[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, USA: ACL Press, 2020: 3718-3727.
9	D'MELLO S K , KORY J . A review and meta-analysis of multimodal affect detection systems. ACM Computing Surveys, 2015, 47 (3): 1- 36.
10	KAUR R , KAUTISH S . Multimodal sentiment analysis: a survey and comparison. Research Anthology on Implementing Sentiment Analysis Across Multiple Disciplines, 2022, 10 (2): 1846- 1870.
11	DAS R , SINGH T D . Multimodal sentiment analysis: a survey of methods, trends and challenges. ACM Computing Surveys, 2023, 55 (13s): 270.
12	GANDHI A , ADHVARYU K , PORIA S , et al. Multimodal sentiment analysis: a systematic review of history, datasets, multimodal fusion methods, applications, challenges and future directions. Information Fusion, 2023, 91, 424- 444. doi: 10.1016/j.inffus.2022.09.025
13	ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. [2024-02-10]. https://arxiv.org/abs/1707.07250v1.
14	LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL]. [2023-04-06]. https://arxiv.org/abs/1806.00064v1.
15	WANG Z L, WAN Z H, WAN X J. TransModality: an End2End fusion method with transformer for multimodal sentiment analysis[C]//Proceedings of the Web Conference 2020. New York, USA: ACM Press, 2020: 2514-2520.
16	HUDDAR M G , SANNAKKI S S , RAJPUROHIT V S . Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification. International Journal of Multimedia Information Retrieval, 2020, 9 (2): 103- 112. doi: 10.1007/s13735-019-00185-8
17	XUE Z H, MARCULESCU R. Dynamic multimodal fusion[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2023: 2575-2584.
18	罗渊贻, 吴锐, 刘家锋, 等. 基于自适应权值融合的多模态情感分析方法[J]. 软件学报, 2024, 35(10): 4781-4793.
	LUO Y Y, WU R, LIU J F, TANG X L. Multimodal sentiment analysis method based on adaptive weight fusion[J]. Journal of Software, 2024, 35(10): 4781-4793. (in Chinese).
19	CHENG H J , YANG Z Z , ZHANG X Q , et al. Multimodal sentiment analysis based on attentional temporal convolutional network and multi-layer feature fusion. IEEE Transactions on Affective Computing, 2023, 14 (4): 3149- 3163. doi: 10.1109/TAFFC.2023.3265653
20	ZHU H , WANG Z , SHI Y , et al. Multimodal fusion method based on self-attention mechanism. Wireless Communications and Mobile Computing, 2020, 2020, 8843186.
21	王旭阳, 董帅, 石杰. 复合层次融合的多模态情感分析. 计算机科学与探索, 2023, 17 (1): 198- 208.
	WANG X Y , DONG S , SHI J . Multimodal sentiment analysis with composite hierarchical fusion. Journal of Frontiers of Computer Science and Technology, 2023, 17 (1): 198- 208.
22	蔡宇扬, 蒙祖强. 基于模态信息交互的多模态情感分析. 计算机应用研究, 2023, 40 (9): 2603- 2608.
	CAI Y Y , MENG Z Q . Multimodal emotion analysis based on modal information interaction. China Industrial Economics, 2023, 40 (9): 2603- 2608.
23	TANG J J , LIU D J , JIN X Y , et al. BAFN: bi-direction attention based fusion network for multimodal sentiment analysis. IEEE Transactions on Circuits and Systems for Video Technology, 2023, 33 (4): 1966- 1978. doi: 10.1109/TCSVT.2022.3218018
24	崔蒙蒙, 刘井平, 阮彤, 等. 基于双重多视角表示的目标级隐性情感分类. 计算机工程, 2024, 50 (1): 79- 90. doi: 10.19678/j.issn.1000-3428.0066459
	CUI M M , LIU J P , RUAN T , et al. Target-level implicit emotion classification based on dual multi-perspective representation. Computer Engineering, 2024, 50 (1): 79- 90. doi: 10.19678/j.issn.1000-3428.0066459
25	杨兴耀, 李志林, 张祖莲, 等. 基于层间融合滤波器与社交神经引文网络的推荐算. 计算机工程, 2024, 50 (11): 98- 106. doi: 10.19678/j.issn.1000-3428.0068532
	YANG X Y , LI Z L , ZHANG Z L , et al. Recommendation algorithm based on interlayer fusion filter and social neural citation network. Computer Engineering, 2024, 50 (11): 98- 106. doi: 10.19678/j.issn.1000-3428.0068532
26	LU J S , YANG J W , BATRA D , et al. Hierarchical question-image co-attention for visual question answering. Advances in Neural Information Processing Systems, 2016, 29, 289- 297.
27	NGUYEN D K, OKATANI T. Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 6087-6096.
28	TSAI Y H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the Conference of the Association for Computational Linguistics. Florence, Italy: ACL Press, 2019: 6558-6569.
29	YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2021: 10790-10797.
30	HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[EB/OL]. [2023-04-06]. https://arxiv.org/abs/2109.00412.
31	SUN L , LIAN Z , LIU B , et al. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing, 2023, 15 (1): 309- 325.

[1]	马跃, 黄周睿, 周雯, 许艺瀚. 基于感受野注意力的轻量化林火检测算法[J]. 计算机工程, 2025, 51(9): 350-361.
[2]	曾碧卿, 姚勇涛, 谢梁琦, 陈鹏飞, 邓会敏, 王瑞棠. 结合局部感知与多层次注意力的多模态方面级情感分析[J]. 计算机工程, 2025, 51(9): 80-90.
[3]	陈晓雷, 王荣. 多分支多尺度点云补全网络[J]. 计算机工程, 2025, 51(8): 330-340.
[4]	马满福, 陈嘉豪, 李勇, 张聪. 基于改进GAT的多特征融合谣言检测模型MFLAN[J]. 计算机工程, 2025, 51(8): 181-189.
[5]	闫建红, 刘芝妍, 王震. 融合时空注意力机制的多尺度卷积车辆轨迹预测[J]. 计算机工程, 2025, 51(8): 406-414.
[6]	刘春霞, 孟吉星, 潘理虎, 龚大立. 融合RGB与IR图像的遥感小目标检测方法[J]. 计算机工程, 2025, 51(7): 326-338.
[7]	栾孟娜, 郑秋梅, 王风华. 基于DMC-YOLO的交通标志实时检测算法[J]. 计算机工程, 2025, 51(7): 90-99.
[8]	沙宇洋, 陆京涛, 杜浩凡, 翟小兵, 孟维宇, 廉旭, 罗刚, 李克峰. 适用于导盲场景的多尺度特征融合轻量化道路图像分割算法[J]. 计算机工程, 2025, 51(7): 314-325.
[9]	周莎, 车生兵, 考友琛, 张旭, 郭甚驿. 基于特征选择和时空特征的网络入侵检测[J]. 计算机工程, 2025, 51(7): 223-231.
[10]	李毅, 徐慧英, 朱信忠, 黄晓, 王舒梦, 李悉钰. 基于YOLOv5n模型改进的口罩检测算法: Mask-YOLO[J]. 计算机工程, 2025, 51(6): 297-310.
[11]	刘凯, 任洪逸, 李蓥, 季怡, 刘纯平. 基于交叉模态注意力特征增强的医学视觉问答[J]. 计算机工程, 2025, 51(6): 49-56.
[12]	曹蓓, 赵奎. 基于双重情感和多特征融合的虚假新闻检测[J]. 计算机工程, 2025, 51(6): 193-203.
[13]	李白芽. 基于CNN-Transformer的电子喉镜病灶及器官分割网络[J]. 计算机工程, 2025, 51(6): 327-337.
[14]	郑诚, 李鹏飞. 基于双超图神经网络特征融合的文本分类[J]. 计算机工程, 2025, 51(6): 127-135.
[15]	徐永刚, 孙琦烜, 李凡甲, 程健维, 戴佳俊. 基于扩展时间和时空特征融合图卷积网络的骨架行为识别[J]. 计算机工程, 2025, 51(4): 281-292.

选择文件类型/文献管理软件名称

选择包含的内容