作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于多尺度注意力与多专家协调决策的音视频情感识别

  • 发布日期:2025-11-21

Audio and Video Emotion Recognition Based on Multiscale Attention and Multi-Expert Coordinated Decision Making

  • Published:2025-11-21

摘要: 多模态情感识别旨在理解复杂的人类情感表达,现有方法在处理情感表达的细微差别和模态间复杂交互时,普遍面临准确性和鲁棒性不足的挑战。具体而言,传统语音特征提取方法难以全面捕捉跨越多时间尺度的情感信息,且现有融合策略在整合互补信息与处理模态间复杂关联方面效率有限,同时,类别不平衡和边界样本问题也常导致模型性能下降。针对上述问题,本文提出了一种面向语音和面部图像的多模态情感识别新方法。该方法首先在语音特征提取阶段引入多尺度注意力机制,替代传统多层感知机,能够自适应地聚焦并捕获从微观音素变化到宏观韵律模式的情感特征,实现了更全面的情感信息提取;其次,设计了自适应多专家协调决策架构,通过专家网络和自适应多模态专家协调网络,高效整合不同模态的互补信息并处理模态间的复杂交互;最后,提出了边界交叉熵损失函数,结合交叉熵与合页损失的优势,以增强模型对边界样本和类别不平衡问题的处理能力。在RAVDESS数据集上的实验表明,该方法准确率达到了89.8%,相较于基线模型提升3.1个百分点,验证了模型改进的有效性。

Abstract: Multimodal emotion recognition aims to understand complex human emotion expressions, however, existing methods generally face the challenges of insufficient accuracy and robustness when dealing with nuances of emotion expressions and complex inter-modal interactions. Specifically, traditional speech feature extraction methods are difficult to comprehensively capture emotion information across multiple time scales, and existing fusion strategies are limited in their efficiency in integrating complementary information and dealing with complex inter-modal associations, while category imbalance and boundary sample problems often lead to degradation of model performance. Aiming at the above problems, this paper proposes a new method for multimodal emotion recognition using speech and facial images. The method firstly introduces a multiscale attention mechanism in the speech feature extraction stage, replacing the traditional multilayer perceptron, which can adaptively focus and capture the emotion features from microscopic phoneme changes to macroscopic rhythmic patterns, and realize a more comprehensive emotion information extraction; secondly, a adaptive multi-expert collaborated decision making architecture is designed, which can be used to recognize the emotion information through expert networks and an adaptive multimodal expert coordination network. Adaptive Multimodal Expert Coordination Network, which efficiently integrates complementary information of different modalities and handles complex interactions between modalities; finally, a boundary