基于层次注意力机制的维度情感识别方法

doi:10.19678/j.issn.1000-3428.0054127

摘要/Abstract

摘要： 在连续维度情感识别任务中，每个模态内部凸显情感表达的部分并不相同，不同模态对于情感状态的影响程度也有差别。为此，通过学习各个模态特征并采用合理的融合方式，提出一种基于层次注意力机制的多模态维度情感识别模型。在音频模态中加入频率注意力机制学习频域上下文信息，利用多模态注意力机制将视频特征与音频特征进行融合，依据改进的损失函数对模态缺失问题进行优化，提高模型的鲁棒性以及情感识别的性能。在公开数据集上的实验结果表明，相比于卷积神经网络和长短时记忆网络等方法，该模型一致性相关系数指标明显提升，并且识别效率更高，可适用于大批量数据的维度情感识别。

关键词: 多模态, 连续维度情感识别, 注意力机制, 特征融合, 深度学习

Abstract: In continuous dimensional emotion recognition,the part of highlighting emotional expression varies in each modality,and different modalities also have different influence on emotional states.To address the problem,by learning modal features and fusing them in a reasonable way,this paper proposes a multimodal dimensional emotion recognition model based on Hierarchical Attention Mechanism(HAM).Frequency attention mechanism is added to the audio modality to learn the context information in frequency domain,and the video features are fused with the audio features by using the multimodal attention mechanism.Then the problem of missing modalities is relieved by using the improved loss function to improve the robustness and emotion recognition performance.Experimental results on public datasets show that compared with methods such as Convolutional Neural Network(CNN) and Long Short Term Memory(LSTM) networks,this method has improved the Concordance Correlation Coefficient(CCC) index,and has higher recognition efficiency.It is applicable to dimensional emotion recognition of large volumes of data.

Key words: multimodality, continuous dimensional emotion recognition, attention mechanism, feature fusion, deep learning

中图分类号:

TP18

汤宇豪, 毛启容, 高利剑. 基于层次注意力机制的维度情感识别方法[J]. 计算机工程, 2020, 46(6): 65-72.

TANG Yuhao, MAO Qirong, GAO Lijian. Dimensional Emotion Recognition Method Based on Hierarchical Attention Mechanism[J]. Computer Engineering, 2020, 46(6): 65-72.

http://www.ecice06.com/CN/Y2020/V46/I6/65

图/表 9

20200617085524

20200617085529

20200617085532

20200617085538

20200617085542

20200617085545

20200617085549

20200617085552

20200617085555

参考文献

[1] PICARD R W.Affective computing[M].[S.1.]:MIT Press,1997.
[2] NICOLAOU M A,GUNES H,PANTIC M.Audio-visual classification and fusion of spontaneous affective data inlikelihood space[C]//Proceedings of the 20th International Conference on Pattern Recognition.Washington D.C.,USA:IEEE Press,2010:3695-3699.
[3] GRIMM M,KROSCHEL K.Emotion estimation in speech using a 3D emotion space concept[EB/OL].[2019-02-20].http://www.doc88.com/p-0877384901283.html.
[4] METALLINOU A,WOLLMER M,KATSAMANIS A,et al.Context-sensitive learning for enhanced audiovisual emotion classification[J].IEEE Transactions on Affective Computing,2012,3(2):184-198.
[5] CHAO Linlin,TAO Jianhua,YANG Minghao,et al.Multiscale temporal modeling for dimensional emotion recognition in video[C]//Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2014:11-18.
[6] RINGEVAL F,EYBEN F,KROUPI E,et al.Prediction of asynchronous dimensional emotion ratings from audiovisual and physiological data[J].Pattern Recognition Letters,2015,66:22-30.
[7] HUANG Jian,LI Ya,TAO Jianhua,et al.End-to-end continuous emotion recognition from video using 3D convlstm networks[C]//Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2018:6837-6841.
[8] ZHU Congxian.Research on deep learning-based speech emotion recognition method[D].Nanjing:Southeast University,2016.(in Chinese)朱从贤.基于深度学习的语音情感识别方法的研究[D].南京:东南大学,2016.
[9] XU K,BA J,KIROS R,et al.Show,attend and tell:neural image caption generation with visual attention[EB/OL].[2019-02-20].https://arxiv.org/abs/1502.03044.
[10] CHOROWSKI J,BAHDANAU D,SERDYUK D,et al.Attention-based models for speech recognition[J].Computer Science,2015,10(4):429-439.
[11] GAO Lilian,GUO Zhao,ZHANG Hanwang,et al.Video captioning with attention-based LSTM and semantic consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055.
[12] JING Chenkai,SONG Tao,ZHUANG Lei,et al.A survey of face recognition based on deep convolutional neural network[J].Computer Applications and Software,2018,35(1):223-231.(in Chinese)景晨凯,宋涛,庄雷,等.基于深度卷积神经网络的人脸识别技术综述[J].计算机应用与软件,2018,35(1):223-231.
[13] ZHANG Jiakang,CHEN Qingkui.CUDA technology based Recognition algorithm of convolutional neural networks[J].Computer Engineering,2010,36(15):179-181.(in Chinese)张佳康,陈庆奎.基于CUDA技术的卷积神经网络识别算法[J].计算机工程,2010,36(15):179-181.
[14] HAN Wenjing,LI Haifeng,RUAN Huanbin,et al.Review on speech emotion recognition[J].Journal of Software,2014,25(1):37-50.(in Chinese)韩文静,李海峰,阮华斌,等.语音情感识别研究进展综述[J].软件学报,2014,25(1):37-50.
[15] FRIEDMAN J,HASTIE T,TIBSHIRANI R.Regularization paths for generalized linear models via coordinate descent[J].Journal of Statistical Software,2010,33(1):1-22.
[16] ISRAELI O.A Shapley-based decomposition of the R-square of a linear regression[J].Journal of Economic Inequality,2007,5(2):199-212.
[17] ADLER J,PARMRYD I.Quantifying colocalization by correlation:the Pearson correlation coefficient is superior to the Mander's overlap coefficient[J].Cytometry Part A,2010,77(8):733-742.
[18] VALSTAR M,GRATCH J.AVEC 2016:depression,mood,and emotion recognition workshop and challenge[EB/OL].[2019-02-20].https://www.researchgate.net/publication.
[19] CHAO Linlin,TAO Jianhua,YANG Minghao,et al.Long short term memory recurrent neural network based multimodal dimensional emotion recognition[C]//Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2015:65-72.
[20] CHEN Shizhe,JIN Qin.Multi-modal dimensional emotion recognition using recurrent neural networks[C]//Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2015:49-56.
[21] SCHUUER B,SCHULLER B.Multimodal sentiment analysis in the wild:Ethical considerations on data collection,annotation,and exploitation[EB/OL].[2019-02-10].http://www.1rec-conf.rogl.
[22] HUANG Z,STASAK B,DANG T,et al.Staircase regression in OA RVM,data selection and gender dependency in AVEC 2016[C]//Proceedings of International Workshop on Audio/visual Emotion Challenge.New York,USA:ACM Press,2016:125-136.
[23] SUN Bo,CAO Siming,LI Liandong,et al.Exploring multimodal visual features for continuous affect recognition[C]//Proceedings of the 6th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2016:325-337.

选择文件类型/文献管理软件名称

选择包含的内容