Speech Emotion Recognition Based on Dynamic Convolution Recurrent Neural Network

doi:10.19678/j.issn.1000-3428.0064054

Abstract

Abstract: Dynamic emotion features are important features in speaker independent speech emotion recognition.However, lack of mining on speech time-frequency information limits the representation ability of existing dynamic emotional features.In this study, a dynamic convolution recurrent neural network speech emotion recognition model is proposed to better extract the dynamic emotional features in speech.First, based on the dynamic convolution theory, a dynamic convolution neural network is constructed to extract the global dynamic emotional information in the spectrogram, and the attention mechanism is used to strengthen the representation of the key emotional regions in the feature map in time and frequency dimensions, respectively;simultaneously, the Bi-directional Long Short-Term Memory(BiLSTM) network is used to learn the spectrum frame by frame to extract the dynamic frame level features and the temporal dependence of emotion;finally, the Maximum Density Divergence(MDD) loss is used to align the new individual features with the feature distribution of the training set, and consequently the impact of individual differences on feature distribution is reduced and the representation ability of the model is improved.The experimental results show that the proposed model achieved 59.50%, 88.01%, and 66.90% weighted average accuracies on the three databases (CASIA, Emo-db, and IEMOCAP), respectively.Compared with other mainstream models(HuWSF, CB-SER, RNN-Att, et al), the recognition accuracy of the proposed model in the three databases is improved by 1.25-16.00, 0.71-2.26, and 2.16-8.10 percentage points, respectively, which verifies the effectiveness of the proposed model.

Key words: speech emotion recognition, feature extraction, dynamic feature, attention mechanism, neural network

摘要： 动态情感特征是说话人独立语音情感识别中的重要特征。由于缺乏对语音中时频信息的充分挖掘，现有动态情感特征表征能力有限。为更好地提取语音中的动态情感特征，提出一种动态卷积递归神经网络语音情感识别模型。基于动态卷积理论构建一种动态卷积神经网络提取语谱图中的全局动态情感信息，使用注意力机制分别从时间和频率维度对特征图关键情感区域进行强化表示，同时利用双向长短期记忆网络对谱图进行逐帧学习，提取动态帧级特征及情感的时序依赖关系。在此基础上，利用最大密度散度损失对齐新个体特征与训练集特征分布，降低个体差异性对特征分布产生的影响，提升模型表征能力。实验结果表明，该模型在CASIA中文情感语料库、Emo-db德文情感语料库及IEMOCAP英文情感语料库上分别取得59.50%、88.01%及66.90%的加权平均精度，相较HuWSF、CB-SER、RNN-Att等其他主流模型识别精度分别提升1.25~16.00、0.71~2.26及2.16~8.10个百分点，验证了所提模型的有效性。

关键词: 语音情感识别, 特征提取, 动态特征, 注意力机制, 神经网络

CLC Number:

TP391.4

GENG Lei, FU Hongliang, TAO Huawei, LU Yuan, GUO Xinying, ZHAO Li. Speech Emotion Recognition Based on Dynamic Convolution Recurrent Neural Network[J]. Computer Engineering, 2023, 49(4): 125-130,137.

耿磊, 傅洪亮, 陶华伟, 卢远, 郭歆莹, 赵力. 基于动态卷积递归神经网络的语音情感识别[J]. 计算机工程, 2023, 49(4): 125-130,137.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0064054

http://www.ecice06.com/EN/Y2023/V49/I4/125

Figures/Tables 9

References

[1] 李海峰, 陈婧, 马琳, 等.维度语音情感识别研究综述[J].软件学报, 2020, 31(8):2465-2491. LI H F, CHEN J, MA L, et al.Dimensional speech emotion recognition review[J].Journal of Software, 2020, 31(8):2465-2491.(in Chinese)
[2] 王忠民, 刘戈, 宋辉.基于多核学习特征融合的语音情感识别方法[J].计算机工程, 2019, 45(8):248-254. WANG Z M, LIU G, SONG H.Speech emotion recognition method based on multiple kernel learning feature fusion[J].Computer Engineering, 2019, 45(8):248-254.(in Chinese)
[3] MAO Q R, DONG M, HUANG Z W, et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE Transactions on Multimedia, 2014, 16(8):2203-2213.
[4] MAO Q R, DONG M, HUANG Z W, et al.Learning salient features for speech emotion recognition using convolutional neural networks[J].IEEE Transactions on Multimedia, 2014, 16(8):2203-2213.
[5] XIE Y, LIANG R Y, LIANG Z L, et al.Speech emotion classification using attention-based LSTM[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685.
[6] ZHANG T, ZHENG W M, CUI Z, et al.Spatial-temporal recurrent neural network for emotion recognition[J].IEEE Transactions on Cybernetics, 2019, 49(3):839-847.
[7] 张会云, 黄鹤鸣.基于异构并行神经网络的语音情感识别[J].计算机工程, 2022, 48(4):113-118. ZHANG H Y, HUANG H M.Speech emotion recognition based on heterogeneous parallel neural network[J].Computer Engineering, 2022, 48(4):113-118.(in Chinese)
[8] JIANG P X, XU X Z, TAO H W, et al.Convolutional-recurrent neural networks with multiple attention mechanisms for speech emotion recognition[J].IEEE Transactions on Cognitive and Developmental Systems, 2022, 14(4):1564-1573.
[9] LI H Q, TU M, HUANG J, et al.Speaker-invariant affective representation learning via adversarial training[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2020:7144-7148.
[10] FAN W, XU X, XING X, et al.Adaptive domain-aware representation learning for speech emotion recognition[C]//Proceedings of INTERSPEECH.Shanghai, China:[s.n.], 2020:4089-4093.
[11] CHEN Y P, DAI X Y, LIU M C, et al.Dynamic convolution:attention over convolution kernels[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2020:11027-11036.
[12] LI J J, CHEN E P, DING Z M, et al.Maximum density divergence for domain adaptation[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(11):3918-3930.
[13] HOU Q B, ZHOU D Q, FENG J S.Coordinate attention for efficient mobile network design[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2021:13708-13717.
[14] WANG K X, AN N, LI B N, et al.Speech emotion recognition using Fourier parameters[J].IEEE Transactions on Affective Computing, 2015, 6(1):69-75.
[15] YI L, MAK M W.Improving speech emotion recognition with adversarial data augmentation network[J].IEEE Transactions on Neural Networks and Learning Systems, 2022, 33(1):172-184.
[16] BUSSO C, BULUT M, LEE C C, et al.IEMOCAP:interactive emotional dyadic motion capture database[J].Language Resources and Evaluation, 2008, 42(4):335-359.
[17] SUN Y, WEN G, WANG J.Weighted spectral features based on local Hu moments for speech emotion recognition[J].Biomedical Signal Processing and Control, 2015, 18:80-90.
[18] WEN G, LI H, HUANG J, et al.Random deep belief networks for recognizing emotions from speech signals[J].Computational Intelligence and Neuroscience, 2017, 2017:1945630.
[19] JIANG P X, FU H L, TAO H W, et al.Parallelized convolutional recurrent neural network with spectral features for speech emotion recognition[J].IEEE Access, 2019, 7:90368-90377.
[20] LI S Z, XING X F, FAN W Q, et al.Spatiotemporal and frequential cascaded attention networks for speech emotion recognition[J].Neurocomputing, 2021, 448:238-248.
[21] STUHLSATZ A, MEYER C, EYBEN F, et al.Deep neural networks for acoustic emotion recognition:raising the benchmarks[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2011:5688-5691.
[22] CHEN M Y, HE X J, YANG J, et al.3-D convolutional recurrent neural networks with attention model for speech emotion recognition[J].IEEE Signal Processing Letters, 2018, 25(10):1440-1444.
[23] ZHANG S Q, ZHANG S L, HUANG T J, et al.Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching[J].IEEE Transactions on Multimedia, 2018, 20(6):1576-1590.
[24] MIRSAMADI S, BARSOUM E, ZHANG C.Automatic speech emotion recognition using recurrent neural networks with local attention[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2017:2227-2231.
[25] ZHAO Z P, ZHENG Y, ZHANG Z X, et al.Exploring spatio-temporal representations by integrating attention-based bidirectional-LSTM-RNNs and FCNs for speech emotion recognition[C]//Proceedings of INTERSPEECH.Hyderabad, India:[s.n.], 2018:272-276.
[26] ANDO A, MASUMURA R, KAMIYAMA H, et al.Speech emotion recognition based on multi-label emotion existence model[C]//Proceedings of INTERSPEECH.Graz, Austria:[s.n.], 2019:2818-2822.

Please choose a citation manager

Content to export