基于多操作网络的图式多域语音情感识别研究

doi:10.19678/j.issn.1000-3428.0061981

计算机工程 ›› 2022, Vol. 48 ›› Issue (7): 59-65. doi: 10.19678/j.issn.1000-3428.0061981

基于多操作网络的图式多域语音情感识别研究

张会云^1,2,3,4, 黄鹤鸣^1,2,3,4

1. 青海师范大学计算机学院, 西宁 810008;
2. 藏语智能信息处理及应用国家重点实验室, 西宁 810008;
3. 藏文信息处理教育部重点实验室, 西宁 810008;
4. 青海省藏文信息处理与机器翻译重点实验室, 西宁 810008

收稿日期:2021-07-05 修回日期:2021-08-25 出版日期:2022-07-15 发布日期:2021-08-30
作者简介:张会云(1993—),女,博士研究生,主研方向为模式识别、智能系统、语音情感识别;黄鹤鸣(通信作者),教授、博士。
基金资助:
国家自然科学基金（62066039）。

Research on Schema Multi-Domain Speech Emotion Recognition Based on Multi-Operation Network

ZHANG Huiyun^1,2,3,4, HUANG Heming^1,2,3,4

1. School of Computer Science, Qinghai Normal University, Xining 810008, China;
2. The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China;
3. Key Laboratory of Tibetan Information Processing, Ministry of Education, Xining 810008, China;
4. Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province, Xining 810008, China

Received:2021-07-05 Revised:2021-08-25 Online:2022-07-15 Published:2021-08-30

摘要/Abstract

摘要： 多域语音情感识别研究在语料标注方法、录制场景以及交互方式等方面存在差异性，使得构建多域语音情感识别系统变得较为复杂。设计一种基于多操作网络的多域语音情感识别模型，通过组合CASIA、EMODB、SAVEE 3个单域数据库，构建Hybrid-CE、Hybrid-ES、Hybrid-CS、Hybrid-CES 4种多域语音情感数据库及层级多操作网络（HMN）。HMN网络由2个异构并行分支组成，左分支由2个同构并行的一维卷积层构成，卷积层的神经元数量均为128，右分支由并行的Bi-GRU层和Bi-LSTM层构成，GRU和LSTM的记忆单元数量均为64。将原始数据投影到不同的变换空间进行计算，从而更准确地表征语音的情感信息。通过分层的Concate、Add和Multiply多操作运算，将左右分支提取的不同特征进行多重融合。在此基础上，计算梅尔频率倒谱系数、色谱图、谱对比度等低级描述符特征的高级统计函数，得到219维特征作为模型HMN的输入。实验结果表明，该模型在4种多域数据库上的F1-score分别达到82.22%、65.02%、70.59%、73.47%，具有较好的鲁棒性和泛化性。

关键词: 语音情感识别, 韵律特征, 谱特征, 多特征融合, 多操作网络

Abstract: Research on multi-domain Speech Emotion Recognition(SER) faces the problem that most available speech corpora differ from each other in crucial ways, such as annotation methods, recording scenarios, interaction mode, etc., thereby making the construction of multi-domain SER system more complex.This paper proposes a multi-domain SER model based on a multi-operation network.First, databases such as CASIA, EMODB, and SAVEE, are combined for the first time to construct 4 multi-domain speech emotion databases.The HMN network is composed of two heterogeneous parallel branches.The left branch is composed of two isomorphic parallel one dimensional convolutional layers, both of which comprise 128 neurons.The right branch is composed of parallel Bi-GRU layer and Bi-LSTM layer, both of which have 64 memory units.The original data are projected to different transform Spaces for calculation so that the emotional information of speech can be more accurately represented.Multiple fusion of different features extracted from left and right branches is performed by hierarchical multi-operation operations Concate, Add, and Multiply.Accordingly, the advanced statistical functions of Mel Frequency Cepturm Comfficient(MFCC), chroma, contrast, and other low level descriptor features were calculated, and 219 dimensional features were obtained as the input of model HMN.Experimental results reveal that the F1-score of the proposed model is 82.22%, 65.02%, 70.59%, and 73.47%, respectively, with good robustness and generalization.

Key words: Speech Emotion Recognition(SER), prosodic feature, spectral feature, multi-feature fusion, multi-operation network

中图分类号:

TP183

张会云, 黄鹤鸣. 基于多操作网络的图式多域语音情感识别研究[J]. 计算机工程, 2022, 48(7): 59-65.

ZHANG Huiyun, HUANG Heming. Research on Schema Multi-Domain Speech Emotion Recognition Based on Multi-Operation Network[J]. Computer Engineering, 2022, 48(7): 59-65.

https://www.ecice06.com/CN/Y2022/V48/I7/59

图/表 10

20220808083954

20220808083958

20220808084002

20220808084013

20220808084017

20220808084021

20220808084024

20220808084027

20220808084031

20220808084034

参考文献

[1] TOOBY J, COSMIDES L.Evolutionary psychology and the emotions and their relationship to internal regulatory variables[M].London, UK:The Guilford Press, 2008.
[2] LAZARUS R S.Emotion and adaptation:conceptual and empirical relations[C]//Proceedings of Nebraska Symposium on Motivation.Lanham, USA:University of Nebraska Press, 1968:175-266.
[3] KRAGEL P A, REDDAN M C, LABAR K S, et al.Emotion schemas are embedded in the human visual system[J].Science Advances, 2019, 5(7):43-58.
[4] KOIRALA A, YU Z W, SCHILTZ H, et al.A preliminary exploration of virtual reality-based visual and touch sensory processing assessment for adolescents with autism spectrum disorder[J].IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2021, 29(1):619-628.
[5] DING K Q, DRAGOMIR A, BOSE R, et al.Sensory stimulation enhances functional connectivity towards the somatosensory cortex in upper limb amputation[C]//Proceedings of the 10th International IEEE/EMBS Conference on Neural Engineering.Washington D.C., USA:IEEE Press, 2021:226-229.
[6] RUSSELL J A.Core affect and the psychological construction of emotion[J].Psychological Review, 2003, 110(1):145-172.
[7] ZAPARA T, ROMASHCHENKO A, PROSKURA A, et al.Mechanisms and functions of neurogenesis in the limbic system of adult animals[C]//Proceedings of Cognitive Sciences, Genomics and Bioinformatics.Washington D.C., USA:IEEE Press, IEEE Press, 2020:174-179.
[8] PARADISO S.Affective neuroscience:the foundations of human and animal emotions[J].American Journal of Psychiatry, 2002, 159(10):1805.
[9] LIU T J, LI F L, JIANG Y, et al.Cortical dynamic causality network for auditory-motor tasks[J].IEEE Transactions on Neural Systems and Rehabilitation Engineering, 2017, 25(8):1092-1099.
[10] 余莉萍, 梁镇麟, 梁瑞宇.基于改进LSTM的儿童语音情感识别模型[J].计算机工程, 2020, 46(6):40-49. YU L P, LIANG Z L, LIANG R Y.Emotion recognition model for children speech based on improved LSTM[J]. Computer Engineering, 2020, 46(6):40-49.(in Chinese)
[11] NEUMANN M, THANG VU N G.CRoss-lingual and multilingual speech emotion recognition on English and French[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, IEEE Press, 2018:5769-5773.
[12] SCHULLER B, VLASENKO B, EYBEN F, et al.Cross-corpus acoustic emotion recognition:variances and strategies[J].IEEE Transactions on Affective Computing, 2010, 1(2):119-131.
[13] FERARU S M, SCHULLER D, SCHULLER B.Cross-language acoustic emotion recognition:an overview and some tendencies[C]//Proceedings of International Conference on Affective Computing and Intelligent Interaction.Washington D.C., USA:IEEE Press, 2015:125-131.
[14] SAGHA H, MATĚJKA P, GAVRYUKOVA M, et al.Enhancing multilingual recognition of emotion in speech by language identification[C]//Proceedings of IEEE ISCA'16.Washington D.C., USA:IEEE Press, 2016:2949-2953.
[15] CHIOU B C, CHEN C P.Speech emotion recognition with cross-lingual databases[C]//Proceedings of IEEE ISCA'14.Washington D.C., USA:IEEE Press, 2014:558-561.
[16] ELGAAR M, PARK J, LEE S W.Multi-speaker and multi-domain emotional voice conversion using factorized hierarchical variational autoencoder[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2020:7769-7773.
[17] ZHANG J C, JIANG L, ZONG Y, et al.Cross-corpus speech emotion recognition using joint distribution adaptive regression[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2021:3790-3794.
[18] MIRSAMADI S, BARSOUM E, ZHANG C.Automatic speech emotion recognition using recurrent neural networks with local attention[C]//Proceedings of 2017 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2017:2227-2231.
[19] TAO F, LIU G.Advanced LSTM:a study about better time dependency modeling in emotion recognition[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2018:2906-2910.
[20] SEPAS-MOGHADDAM A, ETEMAD A, PEREIRA F, et al.Facial emotion recognition using light field images with deep attention-based bidirectional LSTM[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2020:3367-3371.
[21] PENG Z X, LU Y, PAN S F, et al.Efficient speech emotion recognition using multi-scale CNN and attention[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2021:3020-3024.
[22] XIE Y, LIANG R Y, LIANG Z L, et al.Speech emotion classification using attention-based LSTM[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(11):1675-1685.
[23] BLASZKE M, KOSZEWSKI D.Determination of low-level audio descriptors of a musical instrument sound using neural network[C]//Proceedings of Signal Processing:Algorithms, Architectures, Arrangements, and Applications.Washington D.C., USA:IEEE Press, 2020:138-141.
[24] WANG X, DU P J, CHEN D M, et al.Change detection based on low-level to high-level features integration with limited samples[J].IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2020, 13:6260-6276.
[25] WEN X C, LIU K H, ZHANG W M, et al.The application of capsule neural network based CNN for speech emotion recognition[C]//Proceedings of the 25th International Conference on Pattern Recognition.Washington D.C., USA:IEEE Press, 2021:9356-9362.
[26] FU C Z, LIU C R, ISHI C T, et al.An end-to-end multitask learning model to improve speech emotion recognition[C]//Proceedings of the 28th European Signal Processing Conference.Berlin, Germany:Springer, 2021:1-5.
[27] GKIOKAS A, KATSOUROS V, CARAYANNIS G.Towards multi-purpose spectral rhythm features:an application to dance style, meter and tempo estimation[J].IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2016, 24(11):1885-1896.
[28] SHAH A F, ANTO P B.Hybrid spectral features for speech emotion recognition[C]//Proceedings of International Conference on Innovations in Information, Embedded and Communication Systems.Washington D.C., USA:IEEE Press, 2017:1-4.
[29] CAO W H, XU J P, LIU Z T.Speaker-independent speech emotion recognition based on random forest feature selection algorithm[C]//Proceedings of the 36th Chinese Control Conference.Dalian, China:[s.n.], 2017:10995-10998.
[30] YANG X, ROOP P, PEARCE H, et al.A compositional approach using Keras for neural networks in real-time systems[C]//Proceedings of Design, Automation & Test in Europe Conference & Exhibition.Washington D.C., USA:IEEE Press, 2020:1109-1114.

选择文件类型/文献管理软件名称

选择包含的内容

基于多操作网络的图式多域语音情感识别研究

Research on Schema Multi-Domain Speech Emotion Recognition Based on Multi-Operation Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	梁松林, 林伟, 王珏, 杨庆. 面向后渗透攻击行为的网络恶意流量检测研究[J]. 计算机工程, 2024, 50(5): 128-138.
[2]	谢帅康, 熊风光, 朱新杰, 宋宁栋, 李文清, 王廷凤. 基于空间可变形Transformer的三维点云配准方法[J]. 计算机工程, 2024, 50(3): 224-232.
[3]	宋羽凯, 谢江. 基于多任务学习的轻量级语音情感识别模型[J]. 计算机工程, 2023, 49(5): 122-128.
[4]	霍跃华, 赵法起. 基于Stacking与多特征融合的加密恶意流量检测[J]. 计算机工程, 2023, 49(5): 165-172,180.
[5]	陈文轩, 曾碧, 郭植星. 融合多特征与语义图卷积网络的摔倒检测方法[J]. 计算机工程, 2023, 49(5): 277-285,294.
[6]	张博熠, 者甜甜, 赵新旭, 刘庆华, 王家晨. 基于眼嘴状态识别网络的疲劳驾驶检测[J]. 计算机工程, 2023, 49(5): 310-320.
[7]	耿磊, 傅洪亮, 陶华伟, 卢远, 郭歆莹, 赵力. 基于动态卷积递归神经网络的语音情感识别[J]. 计算机工程, 2023, 49(4): 125-130,137.
[8]	王畅, 李雷孝, 杨艳艳. 基于面部多特征融合的疲劳驾驶检测综述[J]. 计算机工程, 2023, 49(11): 1-12.
[9]	张会云, 黄鹤鸣. 面向网络舆情分析的多任务学习策略时间卷积网络[J]. 计算机工程, 2023, 49(10): 89-96, 104.
[10]	周海赟, 项学智, 王馨遥, 任文凯. 多特征融合的端到端链式行人多目标跟踪网络[J]. 计算机工程, 2022, 48(9): 305-313.
[11]	张会云, 黄鹤鸣. 基于异构并行神经网络的语音情感识别[J]. 计算机工程, 2022, 48(4): 113-118.
[12]	乔栋, 陈章进, 邓良, 屠程力. 基于改进语音处理的卷积神经网络中文语音情感识别方法[J]. 计算机工程, 2022, 48(2): 281-290.
[13]	王忠民, 刘戈, 宋辉. 基于多核学习特征融合的语音情感识别方法[J]. 计算机工程, 2019, 45(8): 248-254.
[14]	谭梦婕,吕鑫,陶飞飞. 基于多特征融合的财经新闻话题检测研究[J]. 计算机工程, 2019, 45(3): 293-299,308.
[15]	张希翔,赵欢. 基于随机森林的语音人格预测方法[J]. 计算机工程, 2017, 43(6): 253-258.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于多操作网络的图式多域语音情感识别研究

Research on Schema Multi-Domain Speech Emotion Recognition Based on Multi-Operation Network

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献

相关文章 15

编辑推荐

Metrics

本文评价