基于多尺度注意力与多专家协调决策的音视频情感识别

doi:10.19678/j.issn.1000-3428.0252724

摘要/Abstract

摘要： 多模态情感识别旨在理解复杂的人类情感表达，现有方法在处理情感表达的细微差别和模态间复杂交互时，普遍面临准确性和鲁棒性不足的挑战。具体而言，传统语音特征提取方法难以全面捕捉跨越多时间尺度的情感信息，且现有融合策略在整合互补信息与处理模态间复杂关联方面效率有限，同时，类别不平衡和边界样本问题也常导致模型性能下降。针对上述问题，本文提出了一种面向语音和面部图像的多模态情感识别新方法。该方法首先在语音特征提取阶段引入多尺度注意力机制，替代传统多层感知机，能够自适应地聚焦并捕获从微观音素变化到宏观韵律模式的情感特征，实现了更全面的情感信息提取；其次，设计了自适应多专家协调决策架构，通过专家网络和自适应多模态专家协调网络，高效整合不同模态的互补信息并处理模态间的复杂交互；最后，提出了边界交叉熵损失函数，结合交叉熵与合页损失的优势，以增强模型对边界样本和类别不平衡问题的处理能力。在RAVDESS数据集上的实验表明，该方法准确率达到了89.8%，相较于基线模型提升3.1个百分点，验证了模型改进的有效性。

Abstract: Multimodal emotion recognition aims to understand complex human emotion expressions, however, existing methods generally face the challenges of insufficient accuracy and robustness when dealing with nuances of emotion expressions and complex inter-modal interactions. Specifically, traditional speech feature extraction methods are difficult to comprehensively capture emotion information across multiple time scales, and existing fusion strategies are limited in their efficiency in integrating complementary information and dealing with complex inter-modal associations, while category imbalance and boundary sample problems often lead to degradation of model performance. Aiming at the above problems, this paper proposes a new method for multimodal emotion recognition using speech and facial images. The method firstly introduces a multiscale attention mechanism in the speech feature extraction stage, replacing the traditional multilayer perceptron, which can adaptively focus and capture the emotion features from microscopic phoneme changes to macroscopic rhythmic patterns, and realize a more comprehensive emotion information extraction; secondly, a adaptive multi-expert collaborated decision making architecture is designed, which can be used to recognize the emotion information through expert networks and an adaptive multimodal expert coordination network. Adaptive Multimodal Expert Coordination Network, which efficiently integrates complementary information of different modalities and handles complex interactions between modalities; finally, a boundary

钮焱, 孙杨, 李军. 基于多尺度注意力与多专家协调决策的音视频情感识别[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252724.

NIU Yan, SUN Yang, LI Jun. Audio and Video Emotion Recognition Based on Multiscale Attention and Multi-Expert Coordinated Decision Making[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252724.

参考文献

[1] Latif S, Shahid A, Qadir J. Generative emotional AI for speech emotion recognition: The case for synthetic emotional speech augmentation[J]. Applied Acoustics, 2023, 210: 109425. DOI: 10.1016/j.apacoust.2023.109425.
[2] Liu D, Dai W, Zhang H, et al. Brain-machine coupled learning method for facial emotion recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(9): 10703-10717. DOI: 10.1109/TPAMI.2023.3257846.
[3] Lin W C, Busso C. Sequential modeling by leveraging non-uniform distribution of speech emotion[J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023, 31: 1087-1099. DOI: 10.1109/TASLP.2023.3244527.
[4] Yang L, Zhong J, Wen T, et al. CCIN-SA: Composite cross modal interaction network with attention enhancement for multimodal sentiment analysis[J]. Information Fusion, 2025: 103230. DOI: 10.1016/j.inffus.2025.103230.
[5] Tao X, Li Q, Ren C, et al. Affinity and class probability-based fuzzy support vector machine for imbalanced data sets[J]. Neural Networks, 2020, 122: 289-307. DOI: 10.1016/j.neunet.2019.10.016.
[6] Li J, Zhang X, Li F, et al. Acoustic-articulatory emotion recognition using multiple features and parameter-optimized cascaded deep learning network[J]. Knowledge-Based Systems, 2024, 284: 111276. DOI：10.1016/j.knosys.2023.111276.
[7] Luna-Jiménez C, Kleinlein R, Griol D, et al. A proposal for multimodal emotion recognition using aural transformers and action units on ravdess dataset[J]. Applied Sciences, 2021, 12(1): 327. DOI: 10.3390/app12010327.
[8] Yu W, Xu H, Yuan Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI conference on artificial intelligence. 2021, 35(12): 10790-10797. DOI: 10.1609/aaai.v35i12.17289.
[9] Tang X, Huang J, Lin Y, et al. Speech emotion recognition via cnn-transformer and multidimensional attention mechanism[J]. Speech Communication, 2025: 103242. DOI: 10.1016/j.specom.2025.103242.
[10] Mou L, Zhao Y, Zhou C, et al. Driver emotion recognition with a hybrid attentional multimodal fusion framework[J]. IEEE Transactions on Affective Computing, 2023, 14(4): 2970-2981. DOI: 10.1109/TAFFC.2023.3250460.
[11] Qin Z, Luo Q, Zang Z, et al. Multimodal GRU with directed pairwise cross-modal attention for sentiment analysis[J]. Scientific Reports, 2025, 15(1): 10112. DOI: 10.1038/s41598-025-93023-3.
[12] Li J, Zhang X, Li F, et al. Acoustic-articulatory emotion recognition using multiple features and parameter-optimized cascaded deep learning network[J]. Knowledge-Based Systems, 2024, 284: 111276. DOI: 10.1016/j.knosys.2023.111276.
[13] Hao X, Li H, Wen Y. Real-time music emotion recognition based on multimodal fusion[J]. Alexandria Engineering Journal, 2025, 116: 586-600. DOI: 10.1016/j.aej.2024.12.060.
[14] Ryumina E, Ryumin D, Axyonov A, et al. Multi-corpus emotion recognition method based on cross-modal gated attention fusion[J]. Pattern Recognition Letters, 2025, 190: 192-200. DOI: 10.1016/j.patrec.2025.02.024.
[15] Zhang Y, Jia A, Wang B, et al. M3GAT: A multi-modal, multi-task interactive graph attention network for conversational sentiment analysis and emotion recognition[J]. ACM Transactions on Information Systems, 2023, 42(1): 1-32. DOI: 10.1145/3593583.
[16] Liu X, He G, Li S, et al. Multi-level feature decomposition and fusion model for video-based multimodal emotion recognition[J]. Engine
ering Applications of Artificial Intelligence, 2025, 152: 110744. DOI: 10.1016/j.engappai.2025.110744. [17] Qi X, Wen Y, Zhang P, et al. MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition[J]. Neurocomputing, 2025, 611: 128646. DOI: 10.1016/j.neucom.2024.128646.
[18] 张学军,王天晨,王泽田.基于多域信息融合的卷积Transformer脑电情感识别模型[J].数据采集与处理,2024,39(06):1543-1552. DOI: 10.16337/j.1004-9037.2024.06.021. ZHANG X J, WANG T C, WANG Z T. A convolutional transformer model for EEG emotion recognition based on multi-domain information fusion[J]. Journal of Data Acquisition and Processing, 2024, 39(6): 1543-1552. DOI: 10.16337/j.1004-9037.2024.06.021.
[19] Chatterjee S, Ghosh K, Bhattacharjee S, et al. Federated Artificial Resampling for Imbalanced Facial Emotion Recognition[J]. IEEE Transactions on Affective Computing, 2024: 1461-1472. DOI: 10.1109/TAFFC.2024.3516822.
[20] Alhuzali H, Ananiadou S. Improving textual emotion recognition based on intra-and inter-class variations[J]. IEEE Transactions on Affective Computing, 2021, 14(2): 1297-1307. DOI: 10.1109/TAFFC.2021.3104720.
[21] Fan W, Xu X, Liu F, et al. Multimodal speech emotion recognition via dynamic multilevel contrastive loss under local enhancement network[J]. Expert Systems with Applications, 2025: 127669. DOI: 10.1016/j.eswa.2025.127669.
[22] Franceschini R, Fini E, Beyan C, et al. Multimodal emotion recognition with modality-pairwise unsupervised contrastive loss[C]//2022 26th International Conference on Pattern Recognition (ICPR). IEEE, 2022: 2589-2596. DOI: 10.1109/ICPR56361.2022.9956589.
[23] Zhu X, Men J, Yang L, et al. Imbalanced driving scene recognition with class focal loss and data augmentation[J]. International Journal of Machine Learning and Cybernetics, 2022, 13(10): 2957-2975. DOI: 10.1007/s13042-022-01575-x.
[24] 曹荣贺,吴晓龙,冯畅,等.基于Wav2vec2.0与语境情感信息补偿的对话语音情感识别[J].信号处理,2023,39(04):698-707. DOI: 10.16798/j.issn.1003-0530.2023.04.011. Cao R H, Wu X L, Feng C, et al. Dialogue speech emotion recognition based on Wav2vec2.0 and contextual emotion information compensation[J]. Signal Processing, 2023, 39(04): 698-707. DOI: 10.16798/j.issn.1003- 0530.2023.04.011.
[25] Shou Y, Meng T, Ai W, et al. Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations[J]. Information Fusion, 2024, 112: 102590. DOI: 10.1016/j.inffus.2024.102590.
[26] 王永旗,王雷.基于跨模态增强与时间步门控的多模态情感识别[J].计算机工程,2025,7:1-11. DOI: 10.19678/j.issn.1000-3428.0070508. WANG Y Q, WANG L. Multimodal emotion recognition based on cross-modal enhancement and time-step gating[J]. Computer Engineering, 2025, 7: 1-11. DOI: 10.19678/j.issn.1000-3428.0070508.
[27] Baltrušaitis T, Robinson P, Morency L P. Openface: an open source facial behavior analysis toolkit[C]//2016 IEEE winter conference on applications of computer vision (WACV). IEEE, 2016: 1-10. DOI: 10.1109/WACV.2016.7477553.
[28] Sadok S, Leglaive S, Girin L, et al. A multimodal dynamical variational autoencoder for audiovisual speech representation learning[J]. Neural Networks, 2024, 172: 106120. DOI: 10.1016/j.neunet.2024.106120.
[29] Sadok S, Leglaive S, Séguier R. A vector quantized masked autoencoder for audiovisual speech emotion recognition[J]. Computer Vision and Image Understanding, 2025, 257: 104362. DOI: 10.1016/j.cviu.2025.104362.
[30] Qi X, Wen Y, Zhang P, et al. MFGCN: Multimodal fusion graph convolutional network for speech emotion recognition[J]. Neurocomputing, 2025, 611: 128646. DOI: 10.1016/j.neucom.2024.128646.
[31] Sun L, Lian Z, Liu B, et al. Hicmae: Hierarchical contrastive masked autoencoder for self-supervised audio-visual emotion recognition[J]. Information Fusion, 2024, 108: 102382. DOI: 10.1016/j.inffus.2024.102382.
[32] Mocanu B, Tapu R, Zaharia T. Multimodal emotion recognition using cross modal audio-video fusion with attention and deep metric learning[J]. Image and Vision Computing, 2023, 133: 104676. DOI: 10.1016/j.imavis.2023.104676.

选择文件类型/文献管理软件名称

选择包含的内容