Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2025, Vol. 51 ›› Issue (4): 169-177. doi: 10.19678/j.issn.1000-3428.0069101

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Speech Emotion Recognition Based on Memory Capsules and Attention

DONG Hongliang, NIU Yan, SUN Yang*(), LI Jun   

  1. School of Computer Science, Hubei University of Technology, Wuhan 430068, Hubei, China
  • Received:2023-12-26 Online:2025-04-15 Published:2024-05-30
  • Contact: SUN Yang

基于记忆胶囊与注意力的语音情感识别

董红亮, 钮焱, 孙杨*(), 李军   

  1. 湖北工业大学计算机学院, 湖北 武汉 430068
  • 通讯作者: 孙杨
  • 基金资助:
    国家自然科学基金(62202147)

Abstract:

In current speech emotion recognition systems, the insufficient extraction of emotional features and inadequate modeling ability of models for complex emotional expressions have resulted in decreased recognition accuracy. This paper proposes a method for speech emotion recognition based on memory capsules and attention to improve the current speech emotion recognition accuracy. First, five features of speech, namely, the Mel Frequency Cepstrum Coefficient (MFCC), Root Mean Square (RMS) of energy, Mel-spectrogram, Zero-Crossing Rate (ZCR), and Chromaticity distribution (CHROMA), are extracted. Next, the first-, second-, and third-order differential dynamics of the MFCC are extracted on the basis of the MFCC features, which are then stitched together. Finally, these features are stacked into the form of one-dimensional vectors, and the classification of speech emotion recognition is completed by introducing the model constructed by the memory capsule and attention mechanism. The experimental results show that the proposed model exhibits enhanced generalization and robustness, which effectively improves the accuracy of speech emotion recognition. The accuracies achieved on three datasets, RAVDESS, EMODB, and IEMOCAP, reached 95.87%, 98.82%, and 98.23%, respectively, and the recognition accuracies are effectively improved compared with existing methods.

Key words: speech emotion recognition, feature extraction, feature stacking, memory capsule network, attention mechanism

摘要:

当前语音情感识别中因情感特征提取不充分和模型对复杂情感表达建模能力不足, 导致识别准确率降低。为了提高当前语音情感识别准确率, 提出一种基于记忆胶囊和注意力的语音情感识别方法。首先, 提取了语音中梅尔频率倒谱系数(MFCC)、能量的均方根(RMS)、梅尔语谱图、过零率(ZCR)、色度分布5种特征; 然后, 在MFCC特征的基础上, 提取MFCC的一阶、二阶和三阶差分动态特征, 并将其拼接; 最后, 将这些特征堆叠成一维向量的形式, 通过引入记忆胶囊和注意力机制所构建的模型, 完成对语音情感识别分类工作。实验结果表明, 所提的模型具有较好的泛化性和鲁棒性, 有效提升了语音情感识别的准确率, 在RAVDESS、EMODB和IEMOCAP 3个数据集上的准确率分别达到了95.87%、98.82%和98.23%, 与现有的方法相比, 识别准确率均得到了有效提升。

关键词: 语音情感识别, 特征提取, 特征堆叠, 记忆胶囊网络, 注意力机制