Multi-modal Emotion Recognition Based on Dynamic Convolution and Residual Gating

doi:10.19678/j.issn.1000-3428.0064965

Abstract

Abstract:

To prevent important information containing emotional cues from being obscured by irrelevant information in discourse and to achieve multi-modal information interaction, a multi-modal emotion recognition model based on dynamic convolution and residual gating is proposed by mining advanced local features and designing effective interaction fusion strategies. Low-level features, high-level local features, and contextual dependencies from text, audio, and images are extracted. While using cross modal dynamic convolution to model inter-modal and intra-modal interactions, interactions are simulated between long sequences in time domain, and interaction features of different modalities are captured. A residual gated fusion method that fuses different modal interaction representations automatically learns the impact weight of each interaction feature on the final output, and inputs the multi-modal fusion feature into the classifier for emotion prediction. The experimental results show that this model prevents important information regarding emotional cues from being obscured by irrelevant information in multi-modal data. The accuracy of sentiment classification is 83.5% and 83.9% on the CMU-MOSEI and IEMOCAP datasets, respectively. The model outperforms benchmark models such as Multi-modal Transformer(MulT) and Multi-Fusion Residual Memory(MFRM).

Key words: natural language processing, information interaction, multi-modal emotion recognition, dynamic convolution, gating mechanism

摘要：

为了防止一段话语中含有情感色彩的重要信息被无关信息淹没并实现多模态信息交互，通过挖掘高级局部特征以及设计有效的交互融合策略，提出一种基于动态卷积与残差门控的多模态情感识别模型。提取文本、音频和图像中的低级特征、高级局部特征以及上下文依赖关系，同时使用跨模态动态卷积对模态间和模态内交互信息进行建模，模拟长序列时域间的相互作用，捕捉不同模态的交互特征。设计一种残差门控融合方法来融合不同模态交互表征，自动学习每组交互表征对最终情感识别的影响权重，并将多模态融合特征输入分类器进行情感预测。在CMU-MOSEI和IEMOCAP数据集上的实验结果表明，该模型能够避免多模态中含有情感色彩的重要信息被无关信息淹没，情感分类准确率分别达到83.5%和83.9%，性能优于MulT、MFRM等基准模型。

关键词: 自然语言处理, 信息交互, 多模态情感识别, 动态卷积, 门控机制

Yanxia GUO, Yong JIN, Hong TANG, Jinzhi PENG. Multi-modal Emotion Recognition Based on Dynamic Convolution and Residual Gating[J]. Computer Engineering, 2023, 49(7): 94-101.

郭艳霞, 金勇, 唐宏, 彭金枝. 基于动态卷积与残差门控的多模态情感识别[J]. 计算机工程, 2023, 49(7): 94-101.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064965

http://www.ecice06.com/EN/Y2023/V49/I7/94

Figures/Tables 9

References 25

1	CHEN C H. Research on multi-modal mandarin speech emotion recognition based on SVM[C]//Proceedings of IEEE International Conference on Power, Intelligent Computing and Systems. Washington D. C., USA: IEEE Press, 2019: 173-176.
2	乔栋, 陈章进, 邓良, 等. 基于改进语音处理的卷积神经网络中文语音情感识别方法. 计算机工程, 2022, 48 (2): 281- 290. URL
	QIAO D, CHEN Z J, DENG L, et al. Method for Chinese speech emotion recognition based on improved speech-processing convolutional neural network. Computer Engineering, 2022, 48 (2): 281- 290. URL
3	柳素红, 孙晓, 李春彬. 基于位置信息重建与时频域信息融合的脑电信号情感识别. 计算机工程, 2021, 47 (12): 95- 102. URL
	LIU S H, SUN X, LI C B. Emotion recognition using EEG signals based on location information reconstruction and time-frequency information fusion. Computer Engineering, 2021, 47 (12): 95- 102. URL
4	TAN Y, SUN Z, DUAN F, et al. A multimodal emotion recognition method based on facial expressions and electroencephalography. Biomedical Signal Processing and Control, 2021, 70, 103029. doi: 10.1016/j.bspc.2021.103029
5	DEVLIN J, CHANG M, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understanding[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1810.04805.
6	HSU W N, BOLTE B, TSAI Y H H, et al. HuBERT: self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2021, 29, 3451- 3460. doi: 10.1109/TASLP.2021.3122291
7	BALTRUŠAITIS T, ROBINSON P, MORENCY L P. OpenFace: an open source facial behavior analysis toolkit[C]//Proceedings of IEEE Winter Conference on Applications of Computer Vision. Washington D. C., USA: IEEE Press, 2016: 1-10.
8	ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1707.07250.
9	AREVALO J, SOLORIO T, MONTES-Y-GÓMEZ M, et al. Gated multimodal units for information fusion[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1702.01992.
10	PAN Z, LUO Z, YANG J, et al. Multi-modal attention for speech emotion recognition[EB/OL]. [2022-05-17]. https://arxiv.org/abs/2009.04107.
11	YANG K C, XU H, GAO K. CM-BERT: cross-modal BERT for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York, USA: ACM Press, 2020: 521-528.
12	TSAI Y H, BAI S J, LIANG P P, et al. Multimodal Transformer for unaligned multimodal language sequences[C]//Proceedings of Association for Computational Linguistics Meeting. Philadelphia, USA: ACL Press, 2019: 6558-6569.
13	KWON S. MLT-DNet: speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Systems with Applications, 2021, 167, 114177. doi: 10.1016/j.eswa.2020.114177
14	CHUNG J, GULCEHRE C, CHO K, et al. Empirical evaluation of gated recurrent neural networks on sequence modeling[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1412.3555.
15	TANG G, MÜLLER M, RIOS A, et al. Why self-attention? A targeted evaluation of neural machine translation architectures[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1808.08946.
16	WU F, FAN A, BAEVSKI A, et al. Pay less attention with lightweight and dynamic convolutions[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1901.10430.
17	ZADEH A, PU P. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Philadelphia, USA: ACL Press, 2018: 1-8.
18	BUSSO C, BULUT M, LEE C C, et al. IEMOCAP: interactive emotional dyadic motion capture database. Language Resources and Evaluation, 2008, 42 (4): 335- 359. doi: 10.1007/s10579-008-9076-6
19	PHAM H, LIANG P P, MANZINI T, et al. Found in translation: learning robust joint representations by cyclic translations between modalities[C]//Proceedings of AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 6892-6899.
20	WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors[C]//Proceedings of 2019 AAAI Conference on Artificial Intelligence. Palo Alto, USA: AAAI Press, 2019: 7216-7223.
21	MAI S J, HU H F, XU J, et al. Multi-fusion residual memory network for multimodal human sentiment comprehension. IEEE Transactions on Affective Computing, 2022, 13 (1): 320- 334. doi: 10.1109/TAFFC.2020.3000510
22	LÜ F M, CHEN X, HUANG Y Y, et al. Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 2554-2562.
23	SHENOY A, SARDANA A. Multilogue-Net: a context aware RNN for multi-modal emotion detection and sentiment analysis in conversation[EB/OL]. [2022-05-17]. https://arxiv.org/abs/2002.08267.
24	ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[EB/OL]. [2022-05-17]. https://arxiv.org/abs/1802.00927.
25	RAHMAN W, HASAN M K, LEE S W, et al. Integrating multimodal information in large pretrained Transformers[C]//Proceedings of the Conference Association for Computational Linguistics Meeting. Philadelphia, USA: ACL Press, 2020: 2359-2369.

网络	参数	CMU-MOSEI	IEMOCAP
BERT-base	层数	12	12
	隐藏状态大小	768	768
	注意力头数	12	12
GRU	单元状态维度	200	200
GRU	隐藏状态维度	200	200
训练网络	优化器	Adam	Adam
	Dropout	0.5	0.3
	学习率	0.000 1	0.000 5
	Batch Size	16	32
	Epoch	100	50
动态卷积	核大小	19	9
动态卷积	通道数/个	40	40
时间卷积	核大小(t/v/a)	3/3/3	3/3/5

网络	参数	CMU-MOSEI	IEMOCAP
BERT-base	层数	12	12
	隐藏状态大小	768	768
	注意力头数	12	12
GRU	单元状态维度	200	200
GRU	隐藏状态维度	200	200
训练网络	优化器	Adam	Adam
	Dropout	0.5	0.3
	学习率	0.000 1	0.000 5
	Batch Size	16	32
	Epoch	100	50
动态卷积	核大小	19	9
动态卷积	通道数/个	40	40
时间卷积	核大小(t/v/a)	3/3/3	3/3/5

模型	ACC-2	ACC-7	F1值	MAE	Corr
LF-LSTM	80.6	48.8	80.6	61.9	65.9
MCTN	79.8	49.6	80.6	60.9	67.0
MulT	82.5	51.8	82.3	58.0	70.3
RAVEN	79.1	51.0	79.5	61.4	66.2
MFRM	82.4	50.9	82.6	59.8	69.0
Multilogue-Net	82.1		80.3	59.0
MAG-BERT	82.2		82.6	54.3	76.4
PMR	83.3	52.5	82.6
所提模型	83.5	52.8	83.1	56.8	71.2
∆SOTA	0.2↑	0.3↑	0.5↑	2.5↑	5.2↓

模型	ACC-2	ACC-7	F1值	MAE	Corr
LF-LSTM	80.6	48.8	80.6	61.9	65.9
MCTN	79.8	49.6	80.6	60.9	67.0
MulT	82.5	51.8	82.3	58.0	70.3
RAVEN	79.1	51.0	79.5	61.4	66.2
MFRM	82.4	50.9	82.6	59.8	69.0
Multilogue-Net	82.1		80.3	59.0
MAG-BERT	82.2		82.6	54.3	76.4
PMR	83.3	52.5	82.6
所提模型	83.5	52.8	83.1	56.8	71.2
∆SOTA	0.2↑	0.3↑	0.5↑	2.5↑	5.2↓

模型	快乐		悲伤		愤怒		中性		整体
模型	ACC	F1值	ACC	F1值	ACC	F1值	ACC	F1值	ACC	F1值
EF-LSTM	86.0	84.2	80.2	80.5	85.2	84.5	67.8	67.1	79.8	79.0
MFN	86.5	84.0	83.5	82.1	85.0	83.7	69.6	69.2	81.1	79.7
MulT	90.7	88.6	86.7	86.0	87.4	87.0	72.4	70.7	84.3	83.0
MFRM	87.6	85.9	86.1	85.4	89.4	89.4	70.7	69.8	83.4	82.6
RAVEN	87.3	85.8	83.4	83.1	87.3	86.7	69.7	69.3	81.9	81.2
所提模型	87.1	86.9	86.8	86.8	89.0	88.9	72.8	72.4	83.9	83.7
∆SOTA	3.6↓	1.7↓	0.1↑	0.8↑	0.4↓	0.5↓	0.4↑	1.7↑	0.4↓	0.7↑

Please choose a citation manager

Content to export