作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (7): 94-101. doi: 10.19678/j.issn.1000-3428.0064965

• 人工智能与模式识别 • 上一篇    下一篇

基于动态卷积与残差门控的多模态情感识别

郭艳霞1,2, 金勇1, 唐宏1,2, 彭金枝1,2   

  1. 1. 重庆邮电大学 通信与信息工程学院, 重庆 400065
    2. 重庆邮电大学 移动通信技术重庆市重点实验室, 重庆 400065
  • 收稿日期:2022-06-10 出版日期:2023-07-15 发布日期:2023-07-14
  • 作者简介:

    郭艳霞(1995—),女,硕士研究生,主研方向为情感识别

    金勇,高级工程师、硕士

    唐宏,教授、博士

    彭金枝,硕士研究生

  • 基金资助:
    长江学者和创新团队发展计划(IRT_16R72)

Multi-modal Emotion Recognition Based on Dynamic Convolution and Residual Gating

Yanxia GUO1,2, Yong JIN1, Hong TANG1,2, Jinzhi PENG1,2   

  1. 1. School of Communication and Information Engineering, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2. Chongqing Key Laboratory of Mobile Communications Technology, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2022-06-10 Online:2023-07-15 Published:2023-07-14

摘要:

为了防止一段话语中含有情感色彩的重要信息被无关信息淹没并实现多模态信息交互,通过挖掘高级局部特征以及设计有效的交互融合策略,提出一种基于动态卷积与残差门控的多模态情感识别模型。提取文本、音频和图像中的低级特征、高级局部特征以及上下文依赖关系,同时使用跨模态动态卷积对模态间和模态内交互信息进行建模,模拟长序列时域间的相互作用,捕捉不同模态的交互特征。设计一种残差门控融合方法来融合不同模态交互表征,自动学习每组交互表征对最终情感识别的影响权重,并将多模态融合特征输入分类器进行情感预测。在CMU-MOSEI和IEMOCAP数据集上的实验结果表明,该模型能够避免多模态中含有情感色彩的重要信息被无关信息淹没,情感分类准确率分别达到83.5%和83.9%,性能优于MulT、MFRM等基准模型。

关键词: 自然语言处理, 信息交互, 多模态情感识别, 动态卷积, 门控机制

Abstract:

To prevent important information containing emotional cues from being obscured by irrelevant information in discourse and to achieve multi-modal information interaction, a multi-modal emotion recognition model based on dynamic convolution and residual gating is proposed by mining advanced local features and designing effective interaction fusion strategies. Low-level features, high-level local features, and contextual dependencies from text, audio, and images are extracted. While using cross modal dynamic convolution to model inter-modal and intra-modal interactions, interactions are simulated between long sequences in time domain, and interaction features of different modalities are captured. A residual gated fusion method that fuses different modal interaction representations automatically learns the impact weight of each interaction feature on the final output, and inputs the multi-modal fusion feature into the classifier for emotion prediction. The experimental results show that this model prevents important information regarding emotional cues from being obscured by irrelevant information in multi-modal data. The accuracy of sentiment classification is 83.5% and 83.9% on the CMU-MOSEI and IEMOCAP datasets, respectively. The model outperforms benchmark models such as Multi-modal Transformer(MulT) and Multi-Fusion Residual Memory(MFRM).

Key words: natural language processing, information interaction, multi-modal emotion recognition, dynamic convolution, gating mechanism