作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (4): 113-118. doi: 10.19678/j.issn.1000-3428.0061076

• 人工智能与模式识别 • 上一篇    下一篇

基于异构并行神经网络的语音情感识别

张会云1,2, 黄鹤鸣1,2   

  1. 1. 青海师范大学 计算机学院, 西宁 810008;
    2. 藏语智能信息处理及应用国家重点实验室, 西宁 810008
  • 收稿日期:2021-03-10 修回日期:2021-04-27 发布日期:2021-05-07
  • 作者简介:张会云(1993—),女,博士研究生,主研方向为模式识别与智能系统、语音情感识别;黄鹤鸣,教授、博士。
  • 基金资助:
    国家自然科学基金(62066039)。

Speech Emotion Recognition Based on Heterogeneous Parallel Neural Network

ZHANG Huiyun1,2, HUANG Heming1,2   

  1. 1. Computer College, Qinghai Normal University, Xining 810008, China;
    2. State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
  • Received:2021-03-10 Revised:2021-04-27 Published:2021-05-07

摘要: 提取能表征语音情感的特征并构建具有较强鲁棒性和泛化性的声学模型是语音情感识别系统的核心。面向语音情感识别构建基于注意力机制的异构并行卷积神经网络模型AHPCL,采用长短时记忆网络提取语音情感的时间序列特征,使用卷积操作提取语音空间谱特征,通过将时间信息和空间信息相结合共同表征语音情感,提高预测结果的准确率。利用注意力机制,根据不同时间序列特征对语音情感的贡献程度分配权重,实现从大量特征信息中选择出更能表征语音情感的时间序列。在CASIA、EMODB、SAVEE等3个语音情感数据库上提取音高、过零率、梅尔频率倒谱系数等低级描述符特征,并计算这些低级描述符特征的高级统计函数共得到219维的特征作为输入进行实验验证。结果表明,AHPCL模型在3个语音情感数据库上分别取得了86.02%、84.03%、64.06%的未加权平均召回率,相比LeNet、DNN-ELM和TSFFCNN基线模型具有更强的鲁棒性和泛化性。

关键词: 语音情感识别, 谱特征, 韵律特征, 注意力机制, 异构并行分支, 循环神经网络

Abstract: The core of a Speech Emotion Recognition(SER) system is to extract features that can best represent speech emotion and construct an acoustic model with strong robustness and generalization.In this study, a heterogeneous parallel Recurrent Neural Network(RNN) model based on the attention mechanism AHPCL is constructed for SER.The Long Short-Term Memory(LSTM) network is used to extract the time-series features of speech emotion, and the convolution operation is used to extract the speech spatial spectral features.By combining temporal and spatial information to jointly represent speech emotion, the accuracy of the prediction results is improved.The attention mechanism is used to assign weights according to the contribution of different time-series features to speech emotion to select a time sequence that better represents speech emotion from a large amount of feature information.Low-level descriptor features such as pitch, Zero Crossing Rate(ZCR), and Mel-Frequency Cepstrum Coefficient(MFCC) are extracted from three speech emotion databases, namely CASIA, EMODB, and SAVEE, and the high-level statistical functions of these low-level descriptor features are calculated to obtain 219 dimensional features.The experimental results show that the proposed model achieves 86.02%, 84.03%, and 64.06% Unweighted Average Recall(UAR) on the CASIA, EMODB, and SAVEE databases, respectively.Compared with the LeNet, DNN-ELM, and TSFFCNN baseline models, the AHPCL model exhibits greater robustness and generalization.

Key words: Speech Emotion Recognition(SER), spectral feature, prosodic feature, attention mechanism, heterogeneous parallel branch, Recurrent Neural Network(RNN)

中图分类号: