作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (2): 281-290. doi: 10.19678/j.issn.1000-3428.0060270

• 开发研究与工程应用 • 上一篇    下一篇

基于改进语音处理的卷积神经网络中文语音情感识别方法

乔栋1, 陈章进1,2, 邓良1, 屠程力1   

  1. 1. 上海大学 微电子研究与开发中心, 上海 200444;
    2. 上海大学 计算中心, 上海 200444
  • 收稿日期:2020-12-14 修回日期:2021-02-01 发布日期:2021-02-04
  • 作者简介:乔栋(1995-),男,硕士研究生,主研方向为语音情感识别、集成电路设计;陈章进(通信作者),教授、博士;邓良、屠程力,硕士研究生。
  • 基金资助:
    国家自然科学基金(61674100)。

Method for Chinese Speech Emotion Recognition Based on Improved Speech-Processing Convolutional Neural Network

QIAO Dong1, CHEN Zhangjin1,2, DENG Liang1, TU Chengli1   

  1. 1. Microelectronics Research and Development Center, Shanghai University, Shanghai 200444, China;
    2. Computing Centre, Shanghai University, Shanghai 200444, China
  • Received:2020-12-14 Revised:2021-02-01 Published:2021-02-04

摘要: 语音情感识别在人机交互中具有重要意义。为解决中文语音情感识别效率和准确率低等问题,提出一种基于Trumpet-6卷积神经网络模型的中文语音情感识别方法。在MFCC特征提取过程中,通过增加分帧加窗操作时采样点的个数,增添每个汉明窗内的特征及减少汉明窗个数,从而缩小MFCC特征图的像素尺寸,提高单次识别的处理效率。在此基础上,使用高斯白噪声对数据集进行数据增强处理,缓解训练过程中的过拟合问题。在CASIA语音情感数据集上的实验结果表明,该方法的测试准确率达95.7%,优于Lenet-5、RNN、LSTM等传统方法,且Trumpet-6卷积神经网络模型采用2 048个采样点,仅需176 550个待训练参数,与采用DCNN的ResNet34和循环神经网络模型相比,参数更少,结构更简单,处理速度更快。

关键词: 语音情感识别, MFCC特征, 高斯白噪声, 数据增强, 卷积神经网络

Abstract: Speech emotion recognition is essential in human-computer interaction.In this study, a Chinese speech emotion recognition method based on the Trumpt-6 convolutional neural network model was developed to solve the problem of low efficiency and accuracy of Chinese speech emotion recognition.During the process of extracting the Mel Frequency Cepstral Coefficient (MFCC) feature, the pixel size of the MFCC feature map was reduced to improve the processing efficiency of single recognition.This was achieved by increasing the number of sampling points in the frame windowing operation, adding the features in each Hamming window, and reducing the number of Hamming windows.Gaussian white noise was used to enhance the data set to minimize overfitting during the training process.The experimental results for the CASIA speech emotion data set show that the test accuracy of this method is 95.7%, which is better than those of traditional methods, such as Lenet-5, Recurrent Neural Network(RNN), and Long Short-Term Memory(LSTM).The Trump-6 convolutional neural network model uses 2 048 sampling points and only 176 550 parameters for training.This method has fewer parameters, a simpler structure, and faster processing than ResNet34 and the cyclic neural network model using deep convolutional neural networks.

Key words: speech emotion recognition, MFCC feature, white Gaussian noise, data set enhancement, Convolution Neural Network(CNN)

中图分类号: