摘要： 语音情感识别在人机交互中具有重要意义。为解决中文语音情感识别效率和准确率低等问题，提出一种基于Trumpet-6卷积神经网络模型的中文语音情感识别方法。在MFCC特征提取过程中，通过增加分帧加窗操作时采样点的个数，增添每个汉明窗内的特征及减少汉明窗个数，从而缩小MFCC特征图的像素尺寸，提高单次识别的处理效率。在此基础上，使用高斯白噪声对数据集进行数据增强处理，缓解训练过程中的过拟合问题。在CASIA语音情感数据集上的实验结果表明，该方法的测试准确率达95.7%，优于Lenet-5、RNN、LSTM等传统方法，且Trumpet-6卷积神经网络模型采用2 048个采样点，仅需176 550个待训练参数，与采用DCNN的ResNet34和循环神经网络模型相比，参数更少，结构更简单，处理速度更快。
Abstract: Speech emotion recognition is essential in human-computer interaction.In this study, a Chinese speech emotion recognition method based on the Trumpt-6 convolutional neural network model was developed to solve the problem of low efficiency and accuracy of Chinese speech emotion recognition.During the process of extracting the Mel Frequency Cepstral Coefficient (MFCC) feature, the pixel size of the MFCC feature map was reduced to improve the processing efficiency of single recognition.This was achieved by increasing the number of sampling points in the frame windowing operation, adding the features in each Hamming window, and reducing the number of Hamming windows.Gaussian white noise was used to enhance the data set to minimize overfitting during the training process.The experimental results for the CASIA speech emotion data set show that the test accuracy of this method is 95.7%, which is better than those of traditional methods, such as Lenet-5, Recurrent Neural Network(RNN), and Long Short-Term Memory(LSTM).The Trump-6 convolutional neural network model uses 2 048 sampling points and only 176 550 parameters for training.This method has fewer parameters, a simpler structure, and faster processing than ResNet34 and the cyclic neural network model using deep convolutional neural networks.
speech emotion recognition,
white Gaussian noise,
data set enhancement,
Convolution Neural Network(CNN)