面向深度学习的多模态融合技术研究综述

doi:10.19678/j.issn.1000-3428.0057370

摘要/Abstract

摘要： 面向深度学习的多模态融合技术是指机器从文本、图像、语音和视频等领域获取信息实现转换与融合以提升模型性能，而模态的普遍性和深度学习的热度促进了多模态融合技术的发展。在多模态融合技术发展前期，以提升深度学习模型分类与回归性能为出发点，阐述多模态融合架构、融合方法和对齐技术。重点分析联合、协同、编解码器3种融合架构在深度学习中的应用情况与优缺点，以及多核学习、图像模型和神经网络等具体融合方法与对齐技术，在此基础上归纳多模态融合研究的常用公开数据集，并对跨模态转移学习、模态语义冲突消解、多模态组合评价等下一步的研究方向进行展望。

关键词: 深度学习, 多模态, 模态融合, 模态对齐, 多核学习, 图像模型

Abstract: Multimodal Fusion Technology(MFT) for Deep Learning(DL) refers to the conversion and fusion of information obtained by machine from texts,images,voices,videos and other materials,so as to improve the performance of the model.The universality of modals and the heat of DL boost the rapid development of multimodal fusion.In order to improve the performance of DL model classification or regression,this paper summarizes the multimodal fusion architecture,fusion methods and alignment technologies in the early stage of MFT development.This paper focuses on the analysis of the three fusion architectures:joint,cooperative and codec architectures,in terms of their adoption in DL and advantages/disadvantages.The specific fusion methods and alignment technologies such as Multiple Kernel Learning(MKL),Graphic Model(GM) and Neural Network(NN) are also studied.Finally,the public datasets commonly used in multimodal fusion research are summarized,and the direction of further research in cross-modal transfer learning,resolution of modal semantic conflicts,and multimodal combination evaluation is prospected.

Key words: Deep Learning(DL), multimodality, modal fusion, modal alignment, Multiple Kernel Learning(MKL), Graphical Model(GM)

中图分类号:

TP391.1

何俊, 张彩庆, 李小珍, 张德海. 面向深度学习的多模态融合技术研究综述[J]. 计算机工程, 2020, 46(5): 1-11.

HE Jun, ZHANG Caiqing, LI Xiaozhen, ZHANG Dehai. Survey of Research on Multimodal Fusion Technology for Deep Learning[J]. Computer Engineering, 2020, 46(5): 1-11.

https://www.ecice06.com/CN/Y2020/V46/I5/1

图/表 8

20200513200214

20200513200219

20200513200223

20200513200229

20200513200233

20200513200237

20200513200244

20200513200250

参考文献

[1] RAMACHANDRAM D,TAYLOR G W.Deep multimodal learning:a survey on recent advances and trends[J].IEEE Signal Processing Magazine,2017,34(6):96-108.
[2] BALTRUSAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:a survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):423-443.
[3] PENG Yuxin,QI Jinwei.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].Multimedia,2019,15(1):1-13.
[4] LEDERER C,ALTSTADT S,ANDRIAMONJE S,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York,USA:ACM Press,2010:251-260.
[5] LIU Yanan,FENG Xiaoqing,ZHOU Zhiguang.Multimodal video classification with stacked contractive autoencoders[J].Signal Processing,2016,120(1):761-766.
[6] WU S,BONDUGULA S,LUISIER F.Zeroshot event detection using multi-modal fusion of weakly supervised concepts[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2014:2665-2672.
[7] HABIBIAN A,MENSINK T,SNOEK C G M.Video2vec embeddings recognize events when examples are scarce[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(10):2089-2103.
[8] LI Xia,LU Guanming,YAN Jingjie,et al.A review of multimodal dimension emotion prediction[J].Journal of Automation,2018,44(12):2142-2159.(in Chinese)李霞,卢官明,闫静杰,等.多模态维度情感预测综述[J].自动化学报,2018,44(12):2142-2159.
[9] XIE Zhibing,GUAN Ling.Multimodal information fusion of audio emotion recognition based on kernel entropy component analysis[J].International Journal of Semantic Computing,2013,7(1):25-42.
[10] QI Jinwei,PENG Yuxin,YUAN Yuxin.Cross-modal bidirectional translation via reinforcement learning[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.Stockholm,Sweden:[s.n.],2018:2630-2636.
[11] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-451.
[12] JIANG Yugang,WU Zuxuan,WANG Jun,et al.Exploiting feature and class relationships in video categorization with regularized deep neural networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(2):352-364.
[13] PORIA S,CAMBRIA E,HOWARD N,et al.Fusing audio,visual and textual clues for sentiment analysis from multimodal content[J].Neurocomputing,2016,174:50-59.
[14] ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Palo Alto,USA:AAAI Press,2018:1-35.
[15] FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.Palo Alto,USA:AAAI Press,2016:457-468.
[16] LU J S,YANG J W,BATRA D,et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona,Spain:[s.n.],2016:289-297.
[17] PANG L,NGO C W.Mutlimodal learning with deep Boltzmann machine for emotion prediction in user generated videos[C]//Proceedings of the 5th Asian Conference on Machine Learning.New York,USA:ACM Press,2015:619-622.
[18] HUANG J,KINGSBURY B.Audio-visual deep learning for noise robust speech recognition[C]//Proceedings of the 38th IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2013:7596-7599.
[19] WANG Bokun,YANG Yang,XU Xing,et al.Adversarial cross-modal retrieval[C]//Proceedings of 2017 ACM Multimedia Conference.New York,USA:ACM Press,2017:154-162.
[20] PENG Yuxin,QI Jinwei,YUAN Yuxi.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599.
[21] SOCHER R,KARPATHY Q V L A,MANNING C D,et al.Grounded compositional semantics for finding and describing images with sentences[J].Transactions of the Association for Computational Linguistics,2014,2(1):207-218.
[22] PAN Yingwei,MEI Tao,YAO Ting,et al.Jointly modeling embedding and translation to bridge video and language[C]//Proceedings of 2016 Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:4594-4602.
[23] KIROS R,SALAKHUTDINOV R,RICHARD S Z.Unifying visual-semantic embeddings with multimodal neural language models[J].Computer Science,2014,14(11):2953-2968.
[24] HUANG Xin,PENG Yuxin,YUAN Mingkuan.Cross-modal common representation learning by hybrid transfer network[C]//Proceedings of the 26th International Joint Conference on Artificial.Washington D.C.,USA:IEEE Press,2017:1893-1900.
[25] LIANG Xiaodan,HU Zhiting,ZHANG Hao,et al.Recurrent topic-transition GAN for visual paragraph generation[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2017:3362-3371.
[26] GAO Lianli,GUO Zhaoguo.Video captioning with attention based LSTM and semantic consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055.
[27] YANG Yang,ZHOU Jie,AI Jiangbo.Video captioning by adversarial LSTM[J].IEEE Transactions on Image Processing,2018,27(11):5600-5611.
[28] ZHANG Han,XU Tao,LI Hongsheng.StackGAN:text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2017:5907-5915.
[29] ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of EMNLP'17.Washington D.C.,USA:IEEE Press,2017:1103-1114.
[30] NGIAM J,KHOSLA A,KIM M,et al.Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning.Washington D.C.,USA:IEEE Press,2011:689-696.
[31] SRIVASTAVA N,SALAKHUTV R.Learning representations for multimodal data with deep belief nets[C]//Proceedings of International Conference on Machine Learning.Washington D.C.,USA:IEEE Press,2012:1-8.
[32] HE Yonghao,XIANG Shiming,KANG Cuicui,et al.Cross-modal retrieval via deep and bidirectional representation learning[J].IEEE Transactions on Multimedia,2016,18(7):1363-1377.
[33] LEDERER C,ALTSTADT S,ANDRIAMONJE S,et al.Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence.Palo Alto,AAAI Press,2015:2346-2352.
[34] LIONG V E,LU J W,TAN Y P,et al.Deep coupled metric learning for cross-modal matching[J].IEEE Transactions on Multimedia,2017,19(6):1234-1244.
[35] PENG Yuxin,QI Jinwei,HUANG Xin.CCL:cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420.
[36] MOR N,WOLF L,POLYAK A,et al.A universal music translation network[J].Statistics,2018,2(3):1-14.
[37] HUANG X,LIU M Y,BELONGIE S,et al.Multimodal unsupervised image-to-image translation[C]//Proceedings of the 15th European Conference on Computer Vision.Berlin,Germany:Springer,2018:172-189.
[38] DMELLO S K,KORY J.A review and meta-analysis of multimodal affect detection systems[J].ACM Computing Surveys,2015,47(3):43-50.
[39] ZENG Z,PANTIC M,ROISMAN G I,et al.A survey of affect recognition methods:audio,visual,and spontaneous expressions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(1):39-58.
[40] CASTELLANO G,KESSOUS L,CARIDAKIS G.Emotion recognition through multiple modalities:face,body gesture,speech[J].Affect and Emotion in Human-Computer Interaction,2008,4868(1):92-103.
[41] RAMIREZ G A,BALTRUSAITIS T,MORENCY L P.Modeling latent discriminative dynamic of multi-dimensional affective signals[C]//Proceedings of International Conference on Affective Computing and Intelligent Interaction.Berlin,Germany:Springer,2011:396-406.
[42] LAN Z Z,LEI B,YU S I,et al.Multimedia classification and event detection using double fusion[J].Multimedia Tools and Applications,2014,71(1):333-347.
[43] BUCAK S S,JIN R,JAIN A K.Multiple kernel learning for visual object recognition:a review[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,36(7):1354-1369.
[44] JAQUES N,TAYLOR S,SANO A,et al.Multi-kernel learning for estimating individual wellbeing[C]//Proceedings of NIPS Workshop on Multimodal Machine Learning.Montreal,Quebec:[s.n.],2015:1-7.
[45] SIKKA K,DYKSTRA K,SATHYANARAYANA S,et al.Multiple kernel learning for emotion recognition in the wild[C]//Proceedings of the 15th ACM International Conference on Multimodal Interaction.New York,USA:ACM Press,2013:517-524.
[46] GURBAN M,THIRAN J P,DRUGMAN T,et al.Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition[C]//Proceedings of the 16th International Conference on Multimodal Interfaces.Istanbul,Turkey:[s.n.],2013:237-240.
[47] BALTRUSAITIS T,BANDA N,ROBINSON P.Dimensional affect recognition using continuous conditional random fields[C]//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face,Gesture Recognition.Washington D.C.,USA:IEEE Press,2013:1-8.
[48] JIANG Xinyan,WU Fei,ZHANG Yin,et al.The classification of multi-modal data with hidden conditional random field[J].Pattern Recognition Letters,2015,51(6):63-69.
[49] KAHOU S E,BOUTHILLIER X,LAMBLIN P,et al.EmoNets:multimodal deep learning approaches for emotion recognition in video[J].Journal on Multimodal User Interfaces,2016,10(2):99-111.
[50] WOLLMER M,METALLINOU A,EYBEN F,et al.Context sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.Makuhari,Japan:[s.n.],2010:2362-2365.
[51] CHEN Shizhe,JIN Qin.Multi-modal dimensional emotion recognition using recurrent neural networks[C]//Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2015:49-56.
[52] HINTON G E,SALAKHUTDINOV R R.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507.
[53] MARTINEZ H P,YANNAKAKIS G N.Deep multimodal fusion[C]//Proceedings of the 16th International Conference on Multimodal Interaction.Istanbul,Turkey:[s.n.],2014:34-41.
[54] SIMONYAN K,ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Proceedings of Advances in Neural Information Processing Systems.Berlin,Germany:Springer,2014:568-576.
[55] ROBIN R,MURPH Y.Computer vision and machine learning in science fiction[J].Science Robotics,2019,4(30):7221-7235.
[56] KAHOU S E,PAL C,BOUTHILLIER X,et al.Combining modality specific deep neural networks for emotion recognition in video[C]//Proceedings of the 15th ACM on International Conference on Multimodal Interaction.New York,USA:ACM Press,2013:543-550.
[57] WU D,PIGOU L,KINDERMANS P J,et al.Deep dynamic neural networks for multimodal gesture segmentation and recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(8):1583-1597.
[58] GONEN M,ALPAYDN E.Multiple kernel learning algorithms[J].Journal of Machine Learning Research,2011,12(3):2211-2268.
[59] YEH Y R,LIN T C,CHUNG Y Y,et al.A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection[J].IEEE Trans-actions on Multimedia,2012,14(3):563-574.
[60] MCFEE B,LANCKRIET G R G.Learning multi-modal similarity[J].Journal of Machine Learning Research,2011,12(3):491-523.
[61] LIU Fayao,ZHOU Luping,SHEN Chunhua,et al.Multiple kernel learning in the primal for multimodal Alzheimer's disease classification[J].IEEE Journal of Biomedical and Health Informatics,2014,18(3):984-990.
[62] SUTTON C,MCCALLUM A.Introduction to conditional random fields for relational learning[M]//GETOOR L,TASKAR B.Introduction to statistical relational learning.Cambridge,USA:MIT Press,2006:93-127.
[63] FIDLER S,SHARMA A,URTASUN R.A sentence is worth a thousand pixels holistic CRF model[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2013:1995-2002.
[64] REITER S,SCHULLER B,RIGOLL G.Hidden conditional random fields for meeting segmentation[C]//Proceedings of IEEE International Conference on Multimedia and Expo.Washington D.C.,USA:IEEE Press,2007:639-642.
[65] SONG Y,MORENCY L P,DAVIS R.Multimodal human behavior analysis:learning correlation and interaction across modalities[C]//Proceedings of the 14th International Conference on Multimodal Interaction.Washington D.C.,USA:IEEE Press,2012:27-30.
[66] SONG Y L,MORENCY L P,DAVIS R.Multi-view latent variable discriminative models for action recognition[C]//Proceedings of the 14th International Conference Multimodal Interaction.Washington D.C.,USA:IEEE Press,2012:2120-2127.
[67] GAO Haoyuan,MAO Junhua,ZHOU Jie,et al.Are you talking to a machine dataset and methods for multilingual image question answering[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems.Berlin,Germany:Springer,2015:2296-2304.
[68] NEVEROVA N,WOLF C,TAYLOR G,et al.ModDrop:adaptive multi-modal gesture recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(8):1692-1706.
[69] JIN Qin,LIANG Junwei.Video description generation using audio and visual cues[C]//Proceedings of the 5th ACM International Conference on Multimedia Retrieval.New York,USA:ACM Press,2016:239-242.
[70] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:3156-3164.
[71] RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending long short-term memory for multi-view structured learning[C]//Proceedings of the 14th European Conference on Computer Vision.Berlin,Germany:Springer,2016:338-353.
[72] ANDREJ K,LI F F.Deep visual-semantic alignments for generating image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):664-676.
[73] TAPASWI M,AUML M B,STIEFELHA R.Aligning plot synopses to videos for story-based retrieval[J].International Journal of Multimedia Information Retrieval,2015,4(1):3-16.
[74] TAPASWI M,Auml M B,STIEFELHA R.Book2Movie:aligning video scenes with book chapters[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:1827-1835.
[75] TRIGEORGIS G,NICOLAOU M A,ZAFEIRIOU S,et al.Deep canonical time warping[C]//Proceedings of Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:5110-5118.
[76] PIOTR B,RÉMI L,EDOUARD G,et al.Weakly-supervised alignment of video with text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:4462-4470.
[77] ZHUY K,KIROS R,ZEMEL R,et al.Aligning books and movies:towards story-like visual explanations by watching movies and reading books[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:19-27.
[78] MAO Junhua,HUANG J,TOSHEV A,et al.Generation and comprehension of unambiguous object descriptions[C]//Proceedings of Computer Vision and Pattern Recognition Conference.Washington D.C.,USA:IEEE Press,2016:11-20.
[79] DENGY G,BYRNE W.HMM word and phrase alignment for statistical machine translation[C]//Proceedings of Conference on Human Language Technology and Empirical Methods in Natural.New York,USA:ACM Press,2005:169-176.
[80] KELVIN X,JIMMY B,RYAN K,et al.Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning.New York,USA:ACM Press,2015:2048-2057.
[81] YU Haonan,WANG Jiang,HUANG Zhiheng.Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:4584-4593.
[82] CHEN C,JAFARI R,KEHTARNAV N.UTD-MHAD:a multimodal data set for human action recognition utilizing a depth camera and a wearable inertial sensor[C]//Proceedings of 2015 IEEE International Conference on Image Processing.Washington D.C.,USA:IEEE Press,2015:168-172.
[83] KHAIRE P,IMRAN J,KUMAR P.Human activity recognition by fusion of RGB,depth,and skeletal data[C]//Proceedings of the 2nd International Conference on Computer Vision and Image Processing.Washington D.C.,USA:IEEE Press,2018:409-421.
[84] ESCALERA S,BARÓ X,GONZÀLEZ J.ChaLearn looking at people challenge 2014:dataset and results[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2015:459-473.
[85] OFLI F,CHAUDHRY R,KURILLO G,et al.Berkeley MHAD:a comprehensive multimodal human action databases[C]//Proceedings of 2013 IEEE Workshop on Applications of Computer Vision.Washington D.C.,USA:IEEE Press,2013:53-60.
[86] NG H,VUN T T Y,TONG H L,et al.Action classification on the Berkeley multimodal human action dataset[J].Journal of Engineering and Applied Sciences,2017,12(3):520-526.
[87] WANG Mei,DENG Weihong.Deep face recognition:a survey[EB/OL].[2020-01-07].https://arxiv.org/abs/1804.06655.
[88] SITOVA Z,SEDENKA J,YANG Q,et al.HMOG:new behavioral biometric features for continuous authentication of smartphone users[J].IEEE Transactions on Information Forensics Security,2016,11(5):877-892.
[89] RINGEVAL F,SONDEREGGER A,SAUER J,et al.Introducing the Recola multimodal corpus of remote collaborative and affective interactions[C]//Proceedings of the 10th IEEE International Conference on Automatic Face and Gesture Recognition.Washington D.C.,USA:IEEE Press,2013:1-8.
[90] CHENG Yanfen,CHEN Yaoxin,CHEN Yiling,et al.Speech emotion recognition with attention mechanism and hierarchical context[J].Journal of Harbin University of Technology,2019,51(11):100-107.(in Chinese)程艳芬,陈垚鑫,陈逸灵,等.嵌入注意力机制并结合层级上下文的语音情感识别[J].哈尔滨工业大学学报,2019,51(11):100-107.
[91] WU Q,TENEY D,WANG P,et al.Visual question answering:a survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40.
[92] MAO Junhua,XU Jiajing,JING Yushi,et al.Training and evaluating multimodal word embeddings with large-scale Web annotated images[C]//Proceedings of Advances in Neural Information Processing Systems.Barcelona,Spain:[s.n.],2016:442-450.
[93] SEONG T W,IBRAHIM M Z.A review of audio-visual speech recognition[J].Journal of Telecommunication,Electronic and Computer Engineering,2018,10(1):35-40.
[94] GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.New York,USA:ACM Press,2017:1243-1252.
[95] MIN R,KOSE N,DUGELAY J L.KinectFaceDB:a Kinect database for face recognition[J].IEEE Transactions on Systems Man and Cybernetics,2014,44(11):1534-1548.
[96] WEI Wei,JIA Qingxuan.3D Facial expression recognition based on Kinect[J].International Journal of Innovative Computing Information and Control,2017,13(6):1843-1854.
[97] PENG Yuxin,HUANG Xin,QI Jinwei.Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence.San Francisco,USA:Morgan Kaufmann Press,2016:3846-3853.

选择文件类型/文献管理软件名称

选择包含的内容