计算机工程 ›› 2020, Vol. 46 ›› Issue (6): 40-49.doi: 10.19678/j.issn.1000-3428.0056559

• 人工智能与模式识别 • 上一篇    下一篇

基于改进LSTM的儿童语音情感识别模型

余莉萍1, 梁镇麟2, 梁瑞宇3   

  1. 1. 复旦大学 计算机科学技术学院, 上海 201203;
    2. 东南大学 信息科学工程学院, 南京 210096;
    3. 南京工程学院 信息与通信工程学院, 南京 211167
  • 收稿日期:2019-11-11 修回日期:2019-12-31 发布日期:2020-01-09
  • 作者简介:余莉萍(1994-)女,硕士研究生,主研方向为人工智能、认知科学;梁镇麟,硕士研究生;梁瑞宇,副教授、博士。
  • 基金项目:
    国家自然科学基金(61673108)。

Emotion Recognition Model for Children Speech Based on Improved LSTM

YU Liping1, LIANG Zhenlin2, LIANG Ruiyu3   

  1. 1. School of Computer Science and Technology, Fudan University, Shanghai 201203, China;
    2. School of Information Science and Engineering, Southeast University, Nanjing 210096, China;
    3. School of Information and Communications Engineering, Nanjing Institute of Technology, Nanjing 211167, China
  • Received:2019-11-11 Revised:2019-12-31 Published:2020-01-09

摘要: 为实现不同儿童情感需求状态下帧级语音特征的有效获取,建立一种基于改进长短时记忆(LSTM)网络的儿童语音情感识别模型。采用帧级语音特征代替传统统计特征以保留原始语音中的时序关系,通过引入注意力机制将传统遗忘门和输入门转换为注意力门,并根据自定义的深度策略计算得到深度注意力门,从而提高语音情感识别性能。实验结果表明,在Fau Aibo儿童情感数据语料库及婴儿哭声情感需求数据库上,该模型在召回率和F1分数上相比基于传统LSTM的识别模型分别提高了3.14%、5.50%和1.84%、5.49%,在CASIA中文情感数据库上,其相比基于传统LSTM和GRU的识别模型训练时间更短、儿童语音情感识别率更高。

关键词: 儿童情感, 时序关系, 帧级语音特征, 深度注意力门, 长短时记忆网络

Abstract: To achieve the effective acquisition of frame-level speech features under different emotional needs of children,an emotion recognition model for children speech based on improved Long Short-Time Memory(LSTM) network is established.Frame-level speech features are used to replace the traditional statistical features,so as to retain the time sequence relationship of the original speech.Introducing attention mechanism to convert the traditional forget gate and input gate into attention gate,and the deep attention gate is calculated according to the customized depth strategy,so as to improve the performance of speech emotion recognition.Experimental results show that compared with the traditional LSTM based recognition model,the recall rate and F1 score of the model are increased by 3.14%,5.50% and 1.84%,5.49% respectively on Fau Aibo children’s emotional data corpus and infant crying emotional demand database.Compared with the traditional LSTM and GRU based recognition model,the training time of the model is shorter and the recognition rate of children speech emotion is higher on CASIA Chinese emotion database.

Key words: children’s emotional, time sequence relationship, frame-level speech feature, deep attention gate, Long Short-Term Memory(LSTM)

中图分类号: