作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (5): 122-128. doi: 10.19678/j.issn.1000-3428.0064430

• 人工智能与模式识别 • 上一篇    下一篇

基于多任务学习的轻量级语音情感识别模型

宋羽凯, 谢江   

  1. 上海大学 计算机工程与科学学院, 上海 200444
  • 收稿日期:2022-04-11 修回日期:2022-06-13 发布日期:2022-08-19
  • 作者简介:宋羽凯(1996-),男,硕士研究生,主研方向为机器学习、数据科学;谢江,教授、博士。

Lightweight Speech Emotion Recognition Model Based on Multi-Task Learning

SONG Yukai, XIE Jiang   

  1. School of Computer Engineering and Science, Shanghai University, Shanghai 200444, China
  • Received:2022-04-11 Revised:2022-06-13 Published:2022-08-19

摘要: 现有的语音情感识别(SER)模型存在训练参数量大、模型泛化性能差、情感识别准确率低等问题,利用有限的语音情感数据建立一个轻量级的模型以提高识别效率和准确率尤为重要。提出一种轻量级端到端多任务学习的P-CNN+Gender深度模型,该模型由语音特征组合网络、负责情感特征和性别特征提取的主体卷积网络以及情感和性别分类器组成。以语音的梅尔频率倒谱系数(MFCC)特征作为输入,特征组合网络使用多个大小不同的卷积核从MFCC特征中平行提取特征再进行组合,供后续的主体卷积网络进行情感特征和性别特征的提取。考虑到情感表达和性别的相关性,将性别分类作为辅助任务融合到情感分类中以提高模型的情感分类性能。实验结果表明,该模型在IEMOCAP、Emo-DB和CASIA语音情感数据集上的类别分类准确率分别达到73.3%、96.4%和93.9%,较P-CNN模型分别提高3.0、5.8和6.5个百分点,与3D-ACRNN、CNNBiRNN等模型相比,其训练参数量仅为其他模型的1/10~1/2,且处理速度更快、准确率更高。

关键词: 语音情感识别, MFCC特征, 特征提取, 卷积网络, 深度学习

Abstract: Current Speech Emotion Recognition(SER) models have shortcomings such as large numbers of training parameters,poor model generalization,and low emotion recognition accuracy. Therefore,under the condition of limited sample data,it is particularly important to build a lightweight model to improve model recognition efficiency and accuracy.To this end,this paper proposes a lightweight end-to-end multi-task deep learning model named P-CNN+Gender,which is composed of three parts:a speech feature combination network,body convolutional network responsible for emotion and gender feature extraction,and emotion and gender classifier.The model uses the Mel-Frequency Cepstral Coefficients(MFCC) features of speech as input,and the feature combination network uses convolutional kernels of different sizes to extract MFCC features in parallel and combine them for the subsequent body convolutional network to extract emotion and gender features.Finally,considering the correlation between emotional expression and gender,gender classification is integrated into emotion classification as an auxiliary task to improve the model's emotion classification performance.The model is tested on the IEMOCAP,Emo-DB,and CASIA speech emotion datasets and achieved Unweighted Accuracy(UA) results of 73.3%,96.4% and 93.9%,which are 3.0,5.8,and 6.5 percentage points higher than the P-CNN model,respectively.The training parameter quantity of this model is only 1/10-1/2 that of other models,such as 3D-ACRNN,CNNBiRNN,etc.,and the model achieves faster processing and higher accuracy.

Key words: Speech Emotion Recognition(SER), Mel-Frequency Cepstral Coefficients(MFCC) feature, feature extraction, convolutional network, deep learning

中图分类号: