作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (4): 97-106. doi: 10.19678/j.issn.1000-3428.0069185

• 人工智能与模式识别 • 上一篇    下一篇

基于改进高效通道注意力机制的多特征语音情感识别

杜晨阳, 张雪英, 黄丽霞*(), 李娟   

  1. 太原理工大学电子信息与光学工程学院, 山西 太原 030024
  • 收稿日期:2024-01-08 出版日期:2025-04-15 发布日期:2024-05-29
  • 通讯作者: 黄丽霞
  • 基金资助:
    国家自然科学基金(62271342)

Multi-Feature Speech Emotion Recognition Based on Improved Efficient Channel Attention Mechanism

DU Chenyang, ZHANG Xueying, HUANG Lixia*(), LI Juan   

  1. College of Electronic Information and Optical Engineering, Taiyuan University of Technology, Taiyuan 030024, Shanxi, China
  • Received:2024-01-08 Online:2025-04-15 Published:2024-05-29
  • Contact: HUANG Lixia

摘要:

注意力机制已经广泛地用于语音情感识别(SER)领域, 但是传统注意力模块在提升模型性能表现的同时也会大幅增加模型的参数量。高效通道注意力(ECA)机制虽然参数量较小, 但是只能对通道维度生成注意力权重。针对这个问题, 提出一种改进ECA (IECA)模块, 该模块以较小的参数量对输入的特征图的各个维度生成对应的权重, 使得模型更关注和利用特征图中的重要信息。此外, 为了进一步提升识别率, 分别提取语音的语谱图特征和IS10特征, 通过融合网络对不同支路的预测结果进行决策融合, 得到最终的预测结果。所提出的模型在EMODB和CASIA两个语音情感数据集上分别取得了91.63%、92.46%的加权准确率(WA)和91.25%、92.33%的未加权平均召回率(UAR), 相较之前的研究结果分别有2.69~8.43和4.16~10.69百分点的提升。

关键词: 深度学习, 语音情感识别, 注意力机制, 多特征融合, 决策级融合

Abstract:

The attention mechanism has been widely employed in the field of Speech Emotion Recognition (SER). However, traditional attention modules, while enhancing model performance, also significantly increase the model parameter count. Although the Efficient Channel Attention (ECA) mechanism has a small number of parameters, it can only generate attention weights for the channel dimension. In response to this challenge, an Improved ECA (IECA) module is proposed. IECA module generates corresponding weights for various dimensions of input feature maps with a relatively small number of parameters, enabling the model to more effectively focus on and utilize crucial information within the feature maps. Additionally, to further enhance recognition rates, spectrogram and IS10 features are separately extracted from the speech data. Employing a fusion network, predictions from different branches are combined to yield the final prediction. The proposed model obtained Weighted Accuracy (WA) of 91.63% and 92.46% and Unweighted Average Recall (UAR) of 91.25% and 92.33% on EMODB and CASIA datasets, respectively, which are higher by 2.69-8.43 percentage points and 4.16-10.69 percentage points, respectively, than those reported in previous research.

Key words: deep learning, Speech Emotion Recognition (SER), attention mechanism, multi-feature fusion, decision level fusion