Multi-Feature Speech Emotion Recognition Based on Improved Efficient Channel Attention Mechanism

doi:10.19678/j.issn.1000-3428.0069185

Abstract

Abstract:

The attention mechanism has been widely employed in the field of Speech Emotion Recognition (SER). However, traditional attention modules, while enhancing model performance, also significantly increase the model parameter count. Although the Efficient Channel Attention (ECA) mechanism has a small number of parameters, it can only generate attention weights for the channel dimension. In response to this challenge, an Improved ECA (IECA) module is proposed. IECA module generates corresponding weights for various dimensions of input feature maps with a relatively small number of parameters, enabling the model to more effectively focus on and utilize crucial information within the feature maps. Additionally, to further enhance recognition rates, spectrogram and IS10 features are separately extracted from the speech data. Employing a fusion network, predictions from different branches are combined to yield the final prediction. The proposed model obtained Weighted Accuracy (WA) of 91.63% and 92.46% and Unweighted Average Recall (UAR) of 91.25% and 92.33% on EMODB and CASIA datasets, respectively, which are higher by 2.69-8.43 percentage points and 4.16-10.69 percentage points, respectively, than those reported in previous research.

Key words: deep learning, Speech Emotion Recognition (SER), attention mechanism, multi-feature fusion, decision level fusion

摘要：

注意力机制已经广泛地用于语音情感识别(SER)领域, 但是传统注意力模块在提升模型性能表现的同时也会大幅增加模型的参数量。高效通道注意力(ECA)机制虽然参数量较小, 但是只能对通道维度生成注意力权重。针对这个问题, 提出一种改进ECA (IECA)模块, 该模块以较小的参数量对输入的特征图的各个维度生成对应的权重, 使得模型更关注和利用特征图中的重要信息。此外, 为了进一步提升识别率, 分别提取语音的语谱图特征和IS10特征, 通过融合网络对不同支路的预测结果进行决策融合, 得到最终的预测结果。所提出的模型在EMODB和CASIA两个语音情感数据集上分别取得了91.63%、92.46%的加权准确率(WA)和91.25%、92.33%的未加权平均召回率(UAR), 相较之前的研究结果分别有2.69~8.43和4.16~10.69百分点的提升。

关键词: 深度学习, 语音情感识别, 注意力机制, 多特征融合, 决策级融合

DU Chenyang, ZHANG Xueying, HUANG Lixia, LI Juan. Multi-Feature Speech Emotion Recognition Based on Improved Efficient Channel Attention Mechanism[J]. Computer Engineering, 2025, 51(4): 97-106.

杜晨阳, 张雪英, 黄丽霞, 李娟. 基于改进高效通道注意力机制的多特征语音情感识别[J]. 计算机工程, 2025, 51(4): 97-106.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0069185

https://www.ecice06.com/EN/Y2025/V51/I4/97

Figures/Tables 17

Fig.1 Multi-feature IECA-CTT network structure

Fig.2 ECA module structure

Fig.3 IECA module structure

Fig.4 TCN structure

Fig.5 Weight generation network

Fig.6 Box diagrams of different attention mechanisms on two datasets

Fig.7 t-SNE visualization of features on EMODB data

Fig.8 Confusion matrix of the model on EMODB

Fig.9 Confusion matrix of the model on CASIA

References 26

1	高利军, 薛雷. 语音情感识别综述. 工业控制计算机, 2022, 35(10): 115-116, 120.
	GAO L J, XUE L. Overview of speech emotion recognition. Industrial Control Computer, 2022, 35(10): 115-116, 120.
2	耿磊, 傅洪亮, 陶华伟, 等. 基于动态卷积递归神经网络的语音情感识别. 计算机工程, 2023, 49(4): 125-130, 137. doi: 10.19678/j.issn.1000-3428.0064054
	GENG L, FU H L, TAO H W, et al. Speech emotion recognition based on dynamic convolution recurrent neural network. Computer Engineering, 2023, 49(4): 125-130, 137. doi: 10.19678/j.issn.1000-3428.0064054
3	SWAIN M, ROUTRAY A, KABISATPATHY P. Databases, features and classifiers for speech emotion recognition: a review. International Journal of Speech Technology, 2018, 21(1): 93- 120. doi: 10.1007/s10772-018-9491-z
4	ER M B. A novel approach for classification of speech emotions based on deep and acoustic features. IEEE Access, 2020, 8, 221640- 221653. doi: 10.1109/ACCESS.2020.3043201
5	RAYHAN A M, ISLAM S, MUZAHIDUL I A K M, et al. An ensemble 1D-CNN-LSTM-GRU model with data augmentation for speech emotion recognition. Expert Systems with Applications, 2023, 218, 119633. doi: 10.1016/j.eswa.2023.119633
6	刘欣雨, 夏鸿斌, 刘渊. 说话者特征融合的对话情感识别模型. 小型微型计算机系统, 2025, 46(3): 571- 577.
	LIU X Y, XIA H B, LIU Y. Speaker feature fusion model for emotion recognition in conversation. Journal of Chinese Computer Systems, 2025, 46(3): 571- 577.
7	LATIF S, RANA R, KHALIFA S, et al. Direct modelling of speech emotion from raw speech[EB/OL]. [2023-10-02]. https://arxiv.org/abs/1904.03833v4.
8	孙韩玉, 黄丽霞, 张雪英, 等. 基于双通道卷积门控循环网络的语音情感识别. 计算机工程与应用, 2023, 59(2): 170- 177.
	SUN H Y, HUANG L X, ZHANG X Y, et al. Speech emotion recognition based on dual-channel convolutional gated recurrent network. Computer Engineering and Applications, 2023, 59(2): 170- 177.
9	JAHANGIR R, TEH Y W, MUJTABA G, et al. Convolutional neural network-based cross-corpus speech emotion recognition with data augmentation and features fusion. Machine Vision and Applications, 2022, 33(3): 41. doi: 10.1007/s00138-022-01294-x
10	TANBERK S, TVKEL D B. Ensemble learning with CNN-LSTM combination for speech emotion recognition[C]//Proceedings of International Conference on Computing and Communication Networks. Singapore: Springer Nature Singapore, 2022: 39-47.
11	LE N, NGUYEN K, NGUYEN A, et al. Global-local attention for emotion recognition. Neural Computing and Applications, 2022, 34(24): 21625- 21639. doi: 10.1007/s00521-021-06778-x
12	MENG H, YAN T H, YUAN F, et al. Speech emotion recognition from 3D log-Mel spectrograms with deep learning network. IEEE Access, 2019, 7, 125868- 125881. doi: 10.1109/ACCESS.2019.2938007
13	LIU K, WANG C, CHEN J Y, et al. Time-frequency attention for speech emotion recognition with squeeze-and-excitation blocks[C]//Proceedings of International Conference on Multimedia Modeling. Berlin, Germany: Springer, 2022: 533-543.
14	WANG Q L, WU B G, ZHU P F, et al. ECA-Net: efficient channel attention for deep convolutional neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2020: 11534-11542.
15	HU J, SHEN L, SUN G. Squeeze-and-excitation networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2018: 7132-7141.
16	HOU Q B, ZHOU D Q, FENG J S. Coordinate attention for efficient mobile network design[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2021: 13713-13722.
17	BAI S J, KOLTER J Z, KOLTUN V, et al. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling[EB/OL]. [2023-10-02]. https://arxiv.org/abs/1803.01271v2.
18	BURKHARDT F, PAESCHKE A, ROLFES M, et al. A database of German emotional speech[C]//Proceedings of the 9th European Conference on Speech Communication and Technology. Berlin, Germany: Springer, 2005: 1-10.
19	Institute of Automation, Chinese Academy of Science. CASIA Chinese emotional corpus[EB/OL]. [2023-10-02]. http://more.datatang.com/data/39277.
20	WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of 2018 European Conference on Computer Vision (ECCV). Berlin, Germany: Springer, 2018: 3-19.
21	CHEN M Y, HE X J, YANG J, et al. 3-D convolutional recurrent neural networks with attention model for speech emotion recognition. IEEE Signal Processing Letters, 2018, 25(10): 1440- 1444. doi: 10.1109/LSP.2018.2860246
22	LIU Z, KANG X, REN F J. Dual-TBNet: improving the robustness of speech features via dual-Transformer-BiLSTM for speech emotion recognition. IEEE/ACM Transactions on Audio, Speech and Language Processing, 2023, 31, 2193- 2203. doi: 10.1109/TASLP.2023.3282092
23	CHEN Z Z, LI J W, LIU H, et al. Learning multi-scale features for speech emotion recognition with connection attention mechanism. Expert Systems with Applications, 2023, 214, 118943. doi: 10.1016/j.eswa.2022.118943
24	ZHANG H Y, HUANG H M, HAN H. A novel heterogeneous parallel convolution Bi-LSTM for speech emotion recognition. Applied Sciences, 2021, 11(21): 9897.
25	HAN T, ZHANG Z, REN M Y, et al. Speech emotion recognition based on deep residual shrinkage network. Electronics, 2023, 12(11): 2512.
26	ZHU R F, SUN C X, WEI X P, et al. Speech emotion recognition using channel attention mechanism[C]//Proceedings of the 4th International Conference on Computer Engineering and Application (ICCEA). Washington D. C., USA: IEEE Press, 2023: 680-684.

[1]	XIE Qing, ZHANG Lingfeng, MA Yanchun, LIU Yongjian. Single Image Reflection Removal Model Based on Reflection Classifier and Gradient Restorer [J]. Computer Engineering, 2025, 51(4): 227-238.
[2]	YANG Ping, ZHANG Xi. Improved DeepLabv3+ Road Surface Crack Detection Method [J]. Computer Engineering, 2025, 51(4): 261-270.
[3]	XU Yonggang, SUN Qixuan, LI Fanjia, CHENG Jianwei, DAI Jiajun. Skeleton Behavior Recognition Based on Extended Temporal and Spatiotemporal Feature Fusion Graph Convolutional Network [J]. Computer Engineering, 2025, 51(4): 281-292.
[4]	GENG Xia, WANG Yao. Cloth-Changing Person Re-Identification Method Based on CLIP Enhanced Fine-Grained Features [J]. Computer Engineering, 2025, 51(4): 293-302.
[5]	LIU Yunxiang, LIANG Zhichao. A Highly Efficient Traffic Prediction Model for Continuous Time-series Graph Attention Networks [J]. Computer Engineering, 2025, 51(4): 350-359.
[6]	JIANG Jieping, WANG Mingwen. Residual Behavior Recognition Model Based on Spatio-Temporal Shuffle Attention Mechanism [J]. Computer Engineering, 2025, 51(4): 119-128.
[7]	SUN Ziwen, QIAN Lizhi, YUAN Guanglin, YANG Chuandong, LING Chong. Transformer Object Tracking Method Based on Real-Time Dynamic Template Update [J]. Computer Engineering, 2025, 51(4): 158-168.
[8]	DONG Hongliang, NIU Yan, SUN Yang, LI Jun. Speech Emotion Recognition Based on Memory Capsules and Attention [J]. Computer Engineering, 2025, 51(4): 169-177.
[9]	SUN Ting, YANG Jie, LI Jiaxuan, WANG Yaozong. Optimization of YOLOv7 Road Sign Detection Algorithm for Low-Light Traffic Scenes [J]. Computer Engineering, 2025, 51(3): 342-351.
[10]	LUAN Fangjun, GONG Qi, YUAN Shuai. Crowd Counting Network Based on Attention Mechanism and Multiscale Fusion [J]. Computer Engineering, 2025, 51(3): 352-361.
[11]	HU Shulin, ZHANG Huajun, DENG Xiaotao, WANG Zhenghua. Similarity Calculation for Chinese Text Based on Dependency Graph Convolution [J]. Computer Engineering, 2025, 51(3): 76-85.
[12]	DAI Kangjia, XU Huiying, ZHU Xinzhong, LI Xiyu, HUANG Xiao, CHEN Guoqiang, ZHANG Zhixiong. YGL-SLAM: Point and Line Based Semantic SLAM System for Dynamic Scenes [J]. Computer Engineering, 2025, 51(3): 95-104.
[13]	LU Peng, ZHONG Chuang. Improved CycleGAN Algorithm for Semi-Supervised Building Extraction [J]. Computer Engineering, 2025, 51(3): 241-251.
[14]	WANG Xinliang, WANG Luying. Safety Helmet Detection Algorithm with Feature Enhancement in Low Light Blasting Scenes [J]. Computer Engineering, 2025, 51(3): 252-260.
[15]	HAN Peng, HUANG Yunzhi, REN Caiyue, CHENG Jingyi, XU Jun. Assessment of Neoadjuvant Chemotherapy Efficacy in Breast Cancer Using Dual-Branch Network with PET Imaging [J]. Computer Engineering, 2025, 51(3): 293-299.

Please choose a citation manager

Content to export