Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Polarity-Aware Multimodal Sentiment Analysis with Multi-Scale Encoding

  

  • Published:2025-12-15

基于多尺度编码与极性感知融合的多模态情感分析

Abstract: 】Multimodal sentiment analysis leverages the complementary information of speech, text, and visual modalities to enhance emotion recognition accuracy and robustness. However, existing approaches still face three major challenges: (1) the lack of unified modeling for multi-scale emotional dynamics across fast and slow temporal rhythms; (2) the difficulty in explicitly characterizing semantic dominance and subordination among modalities; and (3) the limited ability to adaptively regulate modality intensity and information contribution. To address these issues, this paper proposes a multimodal sentiment analysis framework that integrates multi-scale encoding with a polarity-aware fusion mechanism. Specifically, a Multi-Scale Mamba encoder (MS-Mamba) is introduced for visual and audio modalities to jointly capture global and local temporal dependencies; a Polarity-Aware Fusion (PAF) module is designed to explicitly model inter-modal enhancement and suppression through semantic residuals and signed weights; and a Polarity-Driven Gating (PDG) mechanism is developed to adaptively control information flow via a saliency–direction disentanglement strategy. These components collaboratively form a closed-loop structure of “temporal modeling–polarity alignment–global gating.” Experimental results on the CMU-MOSI and CMU-MOSEI datasets demonstrate that the proposed model achieves binary classification accuracies of 86.58% and 86.50%, with F1 scores of 86.59% and 86.26%, respectively—yielding an average improvement of approximately 1.3% over mainstream baselines. The results validate the effectiveness and robustness of the proposed method in semantic alignment, temporal modeling, and adaptive fusion.

摘要: 多模态情感分析通过融合语音、文本与视觉模态的协同信息,在提升情绪识别准确性和鲁棒性方面展现出显著优势。 然而,现有方法仍面临三重挑战:其一,缺乏对快慢节奏下多尺度情绪变化的统一建模;其二,难以清晰刻画模态间的语义 主导与从属关系;其三,模型对模态强度与信息价值的动态适配能力仍不足。为此,本文提出一种融合多尺度编码与极性感 知融合机制的多模态情感分析方法:在视觉与音频模态中引入多尺度Mamba编码器(MS-Mamba),并行建模全局与局部时 间粒度;设计极性感知融合模块(Polarity-Aware Fusion, PAF),以语义残差与带符号权重显式刻画跨模态的增强与抑制关系; 并提出全局极性驱动门控机制(Polarity-Driven Gating, PDG),在模态级以显著性–方向性解耦策略实现信息流的自适应调控。 三者协同构成“时序建模–极性对齐–全局门控”的闭环结构。在CMU-MOSI 与 CMU-MOSEI 两个公开数据集上,所提模型的 二分类准确率分别达到86.58%和86.50%,较主流基线平均提升约1.33%;F1分数分别为86.59%和86.26%,较主流基线平均 提升约1.39%。结果表明,该方法在语义对齐、时序建模与自适应融合方面均表现出良好的有效性与鲁棒性。