作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (7): 59-65. doi: 10.19678/j.issn.1000-3428.0061981

• 人工智能与模式识别 • 上一篇    下一篇

基于多操作网络的图式多域语音情感识别研究

张会云1,2,3,4, 黄鹤鸣1,2,3,4   

  1. 1. 青海师范大学 计算机学院, 西宁 810008;
    2. 藏语智能信息处理及应用国家重点实验室, 西宁 810008;
    3. 藏文信息处理教育部重点实验室, 西宁 810008;
    4. 青海省藏文信息处理与机器翻译重点实验室, 西宁 810008
  • 收稿日期:2021-07-05 修回日期:2021-08-25 出版日期:2022-07-15 发布日期:2021-08-30
  • 作者简介:张会云(1993—),女,博士研究生,主研方向为模式识别、智能系统、语音情感识别;黄鹤鸣(通信作者),教授、博士。
  • 基金资助:
    国家自然科学基金(62066039)。

Research on Schema Multi-Domain Speech Emotion Recognition Based on Multi-Operation Network

ZHANG Huiyun1,2,3,4, HUANG Heming1,2,3,4   

  1. 1. School of Computer Science, Qinghai Normal University, Xining 810008, China;
    2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China;
    3. Key Laboratory of Tibetan Information Processing, Ministry of Education, Xining 810008, China;
    4. Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province, Xining 810008, China
  • Received:2021-07-05 Revised:2021-08-25 Online:2022-07-15 Published:2021-08-30

摘要: 多域语音情感识别研究在语料标注方法、录制场景以及交互方式等方面存在差异性,使得构建多域语音情感识别系统变得较为复杂。设计一种基于多操作网络的多域语音情感识别模型,通过组合CASIA、EMODB、SAVEE 3个单域数据库,构建Hybrid-CE、Hybrid-ES、Hybrid-CS、Hybrid-CES 4种多域语音情感数据库及层级多操作网络(HMN)。HMN网络由2个异构并行分支组成,左分支由2个同构并行的一维卷积层构成,卷积层的神经元数量均为128,右分支由并行的Bi-GRU层和Bi-LSTM层构成,GRU和LSTM的记忆单元数量均为64。将原始数据投影到不同的变换空间进行计算,从而更准确地表征语音的情感信息。通过分层的Concate、Add和Multiply多操作运算,将左右分支提取的不同特征进行多重融合。在此基础上,计算梅尔频率倒谱系数、色谱图、谱对比度等低级描述符特征的高级统计函数,得到219维特征作为模型HMN的输入。实验结果表明,该模型在4种多域数据库上的F1-score分别达到82.22%、65.02%、70.59%、73.47%,具有较好的鲁棒性和泛化性。

关键词: 语音情感识别, 韵律特征, 谱特征, 多特征融合, 多操作网络

Abstract: Research on multi-domain Speech Emotion Recognition(SER) faces the problem that most available speech corpora differ from each other in crucial ways, such as annotation methods, recording scenarios, interaction mode, etc., thereby making the construction of multi-domain SER system more complex.This paper proposes a multi-domain SER model based on a multi-operation network.First, databases such as CASIA, EMODB, and SAVEE, are combined for the first time to construct 4 multi-domain speech emotion databases.The HMN network is composed of two heterogeneous parallel branches.The left branch is composed of two isomorphic parallel one dimensional convolutional layers, both of which comprise 128 neurons.The right branch is composed of parallel Bi-GRU layer and Bi-LSTM layer, both of which have 64 memory units.The original data are projected to different transform Spaces for calculation so that the emotional information of speech can be more accurately represented.Multiple fusion of different features extracted from left and right branches is performed by hierarchical multi-operation operations Concate, Add, and Multiply.Accordingly, the advanced statistical functions of Mel Frequency Cepturm Comfficient(MFCC), chroma, contrast, and other low level descriptor features were calculated, and 219 dimensional features were obtained as the input of model HMN.Experimental results reveal that the F1-score of the proposed model is 82.22%, 65.02%, 70.59%, and 73.47%, respectively, with good robustness and generalization.

Key words: Speech Emotion Recognition(SER), prosodic feature, spectral feature, multi-feature fusion, multi-operation network

中图分类号: