Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2023, Vol. 49 ›› Issue (4): 125-130,137. doi: 10.19678/j.issn.1000-3428.0064054

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Speech Emotion Recognition Based on Dynamic Convolution Recurrent Neural Network

GENG Lei1, FU Hongliang1, TAO Huawei1, LU Yuan1, GUO Xinying1, ZHAO Li2   

  1. 1. Key Laboratory of Food Information Processing and Control, Ministry of Education, Henan University of Technology, Zhengzhou 450001, China;
    2. School of Information Science and Engineering, Southeast University, Nanjing 210096, China
  • Received:2022-02-28 Revised:2022-05-08 Published:2023-04-07

基于动态卷积递归神经网络的语音情感识别

耿磊1, 傅洪亮1, 陶华伟1, 卢远1, 郭歆莹1, 赵力2   

  1. 1. 河南工业大学 粮食信息处理与控制教育部重点实验室, 郑州 450001;
    2. 东南大学 信息科学与工程学院, 南京 210096
  • 作者简介:耿磊(1998-),男,硕士研究生,主研方向为语音情感识别、模式识别、智能系统;傅洪亮,教授、博士;陶华伟(通信作者),讲师、博士;卢远,本科生;郭歆莹,副教授、博士;赵力,教授、博士。
  • 基金资助:
    国家自然科学基金(61901159);河南省高等学校重点科研项目(22A520004,22A510001)。

Abstract: Dynamic emotion features are important features in speaker independent speech emotion recognition.However, lack of mining on speech time-frequency information limits the representation ability of existing dynamic emotional features.In this study, a dynamic convolution recurrent neural network speech emotion recognition model is proposed to better extract the dynamic emotional features in speech.First, based on the dynamic convolution theory, a dynamic convolution neural network is constructed to extract the global dynamic emotional information in the spectrogram, and the attention mechanism is used to strengthen the representation of the key emotional regions in the feature map in time and frequency dimensions, respectively;simultaneously, the Bi-directional Long Short-Term Memory(BiLSTM) network is used to learn the spectrum frame by frame to extract the dynamic frame level features and the temporal dependence of emotion;finally, the Maximum Density Divergence(MDD) loss is used to align the new individual features with the feature distribution of the training set, and consequently the impact of individual differences on feature distribution is reduced and the representation ability of the model is improved.The experimental results show that the proposed model achieved 59.50%, 88.01%, and 66.90% weighted average accuracies on the three databases (CASIA, Emo-db, and IEMOCAP), respectively.Compared with other mainstream models(HuWSF, CB-SER, RNN-Att, et al), the recognition accuracy of the proposed model in the three databases is improved by 1.25-16.00, 0.71-2.26, and 2.16-8.10 percentage points, respectively, which verifies the effectiveness of the proposed model.

Key words: speech emotion recognition, feature extraction, dynamic feature, attention mechanism, neural network

摘要: 动态情感特征是说话人独立语音情感识别中的重要特征。由于缺乏对语音中时频信息的充分挖掘,现有动态情感特征表征能力有限。为更好地提取语音中的动态情感特征,提出一种动态卷积递归神经网络语音情感识别模型。基于动态卷积理论构建一种动态卷积神经网络提取语谱图中的全局动态情感信息,使用注意力机制分别从时间和频率维度对特征图关键情感区域进行强化表示,同时利用双向长短期记忆网络对谱图进行逐帧学习,提取动态帧级特征及情感的时序依赖关系。在此基础上,利用最大密度散度损失对齐新个体特征与训练集特征分布,降低个体差异性对特征分布产生的影响,提升模型表征能力。实验结果表明,该模型在CASIA中文情感语料库、Emo-db德文情感语料库及IEMOCAP英文情感语料库上分别取得59.50%、88.01%及66.90%的加权平均精度,相较HuWSF、CB-SER、RNN-Att等其他主流模型识别精度分别提升1.25~16.00、0.71~2.26及2.16~8.10个百分点,验证了所提模型的有效性。

关键词: 语音情感识别, 特征提取, 动态特征, 注意力机制, 神经网络

CLC Number: