[1] RAMACHANDRAM D,TAYLOR G W.Deep multimodal learning:a survey on recent advances and trends[J].IEEE Signal Processing Magazine,2017,34(6):96-108. [2] BALTRUSAITIS T,AHUJA C,MORENCY L P.Multimodal machine learning:a survey and taxonomy[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2019,41(2):423-443. [3] PENG Yuxin,QI Jinwei.CM-GANs:cross-modal generative adversarial networks for common representation learning[J].Multimedia,2019,15(1):1-13. [4] LEDERER C,ALTSTADT S,ANDRIAMONJE S,et al.A new approach to cross-modal multimedia retrieval[C]//Proceedings of the 18th ACM International Conference on Multimedia.New York,USA:ACM Press,2010:251-260. [5] LIU Yanan,FENG Xiaoqing,ZHOU Zhiguang.Multimodal video classification with stacked contractive autoencoders[J].Signal Processing,2016,120(1):761-766. [6] WU S,BONDUGULA S,LUISIER F.Zeroshot event detection using multi-modal fusion of weakly supervised concepts[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2014:2665-2672. [7] HABIBIAN A,MENSINK T,SNOEK C G M.Video2vec embeddings recognize events when examples are scarce[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(10):2089-2103. [8] LI Xia,LU Guanming,YAN Jingjie,et al.A review of multimodal dimension emotion prediction[J].Journal of Automation,2018,44(12):2142-2159.(in Chinese)李霞,卢官明,闫静杰,等.多模态维度情感预测综述[J].自动化学报,2018,44(12):2142-2159. [9] XIE Zhibing,GUAN Ling.Multimodal information fusion of audio emotion recognition based on kernel entropy component analysis[J].International Journal of Semantic Computing,2013,7(1):25-42. [10] QI Jinwei,PENG Yuxin,YUAN Yuxin.Cross-modal bidirectional translation via reinforcement learning[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence.Stockholm,Sweden:[s.n.],2018:2630-2636. [11] LECUN Y,BENGIO Y,HINTON G.Deep learning[J].Nature,2015,521(7553):436-451. [12] JIANG Yugang,WU Zuxuan,WANG Jun,et al.Exploiting feature and class relationships in video categorization with regularized deep neural networks[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2018,40(2):352-364. [13] PORIA S,CAMBRIA E,HOWARD N,et al.Fusing audio,visual and textual clues for sentiment analysis from multimodal content[J].Neurocomputing,2016,174:50-59. [14] ZADEH A,LIANG P P,PORIA S,et al.Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence.Palo Alto,USA:AAAI Press,2018:1-35. [15] FUKUI A,PARK D H,YANG D,et al.Multimodal compact bilinear pooling for visual question answering and visual grounding[C]//Proceedings of Conference on Empirical Methods in Natural Language Processing.Palo Alto,USA:AAAI Press,2016:457-468. [16] LU J S,YANG J W,BATRA D,et al.Hierarchical question-image co-attention for visual question answering[C]//Proceedings of the 30th Conference on Neural Information Processing Systems.Barcelona,Spain:[s.n.],2016:289-297. [17] PANG L,NGO C W.Mutlimodal learning with deep Boltzmann machine for emotion prediction in user generated videos[C]//Proceedings of the 5th Asian Conference on Machine Learning.New York,USA:ACM Press,2015:619-622. [18] HUANG J,KINGSBURY B.Audio-visual deep learning for noise robust speech recognition[C]//Proceedings of the 38th IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2013:7596-7599. [19] WANG Bokun,YANG Yang,XU Xing,et al.Adversarial cross-modal retrieval[C]//Proceedings of 2017 ACM Multimedia Conference.New York,USA:ACM Press,2017:154-162. [20] PENG Yuxin,QI Jinwei,YUAN Yuxi.Modality-specific cross-modal similarity measurement with recurrent attention network[J].IEEE Transactions on Image Processing,2018,27(11):5585-5599. [21] SOCHER R,KARPATHY Q V L A,MANNING C D,et al.Grounded compositional semantics for finding and describing images with sentences[J].Transactions of the Association for Computational Linguistics,2014,2(1):207-218. [22] PAN Yingwei,MEI Tao,YAO Ting,et al.Jointly modeling embedding and translation to bridge video and language[C]//Proceedings of 2016 Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:4594-4602. [23] KIROS R,SALAKHUTDINOV R,RICHARD S Z.Unifying visual-semantic embeddings with multimodal neural language models[J].Computer Science,2014,14(11):2953-2968. [24] HUANG Xin,PENG Yuxin,YUAN Mingkuan.Cross-modal common representation learning by hybrid transfer network[C]//Proceedings of the 26th International Joint Conference on Artificial.Washington D.C.,USA:IEEE Press,2017:1893-1900. [25] LIANG Xiaodan,HU Zhiting,ZHANG Hao,et al.Recurrent topic-transition GAN for visual paragraph generation[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2017:3362-3371. [26] GAO Lianli,GUO Zhaoguo.Video captioning with attention based LSTM and semantic consistency[J].IEEE Transactions on Multimedia,2017,19(9):2045-2055. [27] YANG Yang,ZHOU Jie,AI Jiangbo.Video captioning by adversarial LSTM[J].IEEE Transactions on Image Processing,2018,27(11):5600-5611. [28] ZHANG Han,XU Tao,LI Hongsheng.StackGAN:text to photo-realistic image synthesis with stacked generative adversarial networks[C]//Proceedings of 2017 IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2017:5907-5915. [29] ZADEH A,CHEN M,PORIA S,et al.Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of EMNLP'17.Washington D.C.,USA:IEEE Press,2017:1103-1114. [30] NGIAM J,KHOSLA A,KIM M,et al.Multimodal deep learning[C]//Proceedings of the 28th International Conference on Machine Learning.Washington D.C.,USA:IEEE Press,2011:689-696. [31] SRIVASTAVA N,SALAKHUTV R.Learning representations for multimodal data with deep belief nets[C]//Proceedings of International Conference on Machine Learning.Washington D.C.,USA:IEEE Press,2012:1-8. [32] HE Yonghao,XIANG Shiming,KANG Cuicui,et al.Cross-modal retrieval via deep and bidirectional representation learning[J].IEEE Transactions on Multimedia,2016,18(7):1363-1377. [33] LEDERER C,ALTSTADT S,ANDRIAMONJE S,et al.Jointly modeling deep video and compositional text to bridge vision and language in a unified framework[C]//Proceedings of the 29th AAAI Conference on Artificial Intelligence.Palo Alto,AAAI Press,2015:2346-2352. [34] LIONG V E,LU J W,TAN Y P,et al.Deep coupled metric learning for cross-modal matching[J].IEEE Transactions on Multimedia,2017,19(6):1234-1244. [35] PENG Yuxin,QI Jinwei,HUANG Xin.CCL:cross-modal correlation learning with multigrained fusion by hierarchical network[J].IEEE Transactions on Multimedia,2017,20(2):405-420. [36] MOR N,WOLF L,POLYAK A,et al.A universal music translation network[J].Statistics,2018,2(3):1-14. [37] HUANG X,LIU M Y,BELONGIE S,et al.Multimodal unsupervised image-to-image translation[C]//Proceedings of the 15th European Conference on Computer Vision.Berlin,Germany:Springer,2018:172-189. [38] DMELLO S K,KORY J.A review and meta-analysis of multimodal affect detection systems[J].ACM Computing Surveys,2015,47(3):43-50. [39] ZENG Z,PANTIC M,ROISMAN G I,et al.A survey of affect recognition methods:audio,visual,and spontaneous expressions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2009,31(1):39-58. [40] CASTELLANO G,KESSOUS L,CARIDAKIS G.Emotion recognition through multiple modalities:face,body gesture,speech[J].Affect and Emotion in Human-Computer Interaction,2008,4868(1):92-103. [41] RAMIREZ G A,BALTRUSAITIS T,MORENCY L P.Modeling latent discriminative dynamic of multi-dimensional affective signals[C]//Proceedings of International Conference on Affective Computing and Intelligent Interaction.Berlin,Germany:Springer,2011:396-406. [42] LAN Z Z,LEI B,YU S I,et al.Multimedia classification and event detection using double fusion[J].Multimedia Tools and Applications,2014,71(1):333-347. [43] BUCAK S S,JIN R,JAIN A K.Multiple kernel learning for visual object recognition:a review[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2014,36(7):1354-1369. [44] JAQUES N,TAYLOR S,SANO A,et al.Multi-kernel learning for estimating individual wellbeing[C]//Proceedings of NIPS Workshop on Multimodal Machine Learning.Montreal,Quebec:[s.n.],2015:1-7. [45] SIKKA K,DYKSTRA K,SATHYANARAYANA S,et al.Multiple kernel learning for emotion recognition in the wild[C]//Proceedings of the 15th ACM International Conference on Multimodal Interaction.New York,USA:ACM Press,2013:517-524. [46] GURBAN M,THIRAN J P,DRUGMAN T,et al.Dynamic modality weighting for multi-stream HMMs in audio-visual speech recognition[C]//Proceedings of the 16th International Conference on Multimodal Interfaces.Istanbul,Turkey:[s.n.],2013:237-240. [47] BALTRUSAITIS T,BANDA N,ROBINSON P.Dimensional affect recognition using continuous conditional random fields[C]//Proceedings of the 10th IEEE International Conference and Workshops on Automatic Face,Gesture Recognition.Washington D.C.,USA:IEEE Press,2013:1-8. [48] JIANG Xinyan,WU Fei,ZHANG Yin,et al.The classification of multi-modal data with hidden conditional random field[J].Pattern Recognition Letters,2015,51(6):63-69. [49] KAHOU S E,BOUTHILLIER X,LAMBLIN P,et al.EmoNets:multimodal deep learning approaches for emotion recognition in video[J].Journal on Multimodal User Interfaces,2016,10(2):99-111. [50] WOLLMER M,METALLINOU A,EYBEN F,et al.Context sensitive multimodal emotion recognition from speech and facial expression using bidirectional LSTM modeling[C]//Proceedings of the 11th Annual Conference of the International Speech Communication Association.Makuhari,Japan:[s.n.],2010:2362-2365. [51] CHEN Shizhe,JIN Qin.Multi-modal dimensional emotion recognition using recurrent neural networks[C]//Proceedings of the 5th International Workshop on Audio/Visual Emotion Challenge.New York,USA:ACM Press,2015:49-56. [52] HINTON G E,SALAKHUTDINOV R R.Reducing the dimensionality of data with neural networks[J].Science,2006,313(5786):504-507. [53] MARTINEZ H P,YANNAKAKIS G N.Deep multimodal fusion[C]//Proceedings of the 16th International Conference on Multimodal Interaction.Istanbul,Turkey:[s.n.],2014:34-41. [54] SIMONYAN K,ZISSERMAN A.Two-stream convolutional networks for action recognition in videos[C]//Proceedings of Advances in Neural Information Processing Systems.Berlin,Germany:Springer,2014:568-576. [55] ROBIN R,MURPH Y.Computer vision and machine learning in science fiction[J].Science Robotics,2019,4(30):7221-7235. [56] KAHOU S E,PAL C,BOUTHILLIER X,et al.Combining modality specific deep neural networks for emotion recognition in video[C]//Proceedings of the 15th ACM on International Conference on Multimodal Interaction.New York,USA:ACM Press,2013:543-550. [57] WU D,PIGOU L,KINDERMANS P J,et al.Deep dynamic neural networks for multimodal gesture segmentation and recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(8):1583-1597. [58] GONEN M,ALPAYDN E.Multiple kernel learning algorithms[J].Journal of Machine Learning Research,2011,12(3):2211-2268. [59] YEH Y R,LIN T C,CHUNG Y Y,et al.A novel multiple kernel learning framework for heterogeneous feature fusion and variable selection[J].IEEE Trans-actions on Multimedia,2012,14(3):563-574. [60] MCFEE B,LANCKRIET G R G.Learning multi-modal similarity[J].Journal of Machine Learning Research,2011,12(3):491-523. [61] LIU Fayao,ZHOU Luping,SHEN Chunhua,et al.Multiple kernel learning in the primal for multimodal Alzheimer's disease classification[J].IEEE Journal of Biomedical and Health Informatics,2014,18(3):984-990. [62] SUTTON C,MCCALLUM A.Introduction to conditional random fields for relational learning[M]//GETOOR L,TASKAR B.Introduction to statistical relational learning.Cambridge,USA:MIT Press,2006:93-127. [63] FIDLER S,SHARMA A,URTASUN R.A sentence is worth a thousand pixels holistic CRF model[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2013:1995-2002. [64] REITER S,SCHULLER B,RIGOLL G.Hidden conditional random fields for meeting segmentation[C]//Proceedings of IEEE International Conference on Multimedia and Expo.Washington D.C.,USA:IEEE Press,2007:639-642. [65] SONG Y,MORENCY L P,DAVIS R.Multimodal human behavior analysis:learning correlation and interaction across modalities[C]//Proceedings of the 14th International Conference on Multimodal Interaction.Washington D.C.,USA:IEEE Press,2012:27-30. [66] SONG Y L,MORENCY L P,DAVIS R.Multi-view latent variable discriminative models for action recognition[C]//Proceedings of the 14th International Conference Multimodal Interaction.Washington D.C.,USA:IEEE Press,2012:2120-2127. [67] GAO Haoyuan,MAO Junhua,ZHOU Jie,et al.Are you talking to a machine dataset and methods for multilingual image question answering[C]//Proceedings of the 29th Annual Conference on Neural Information Processing Systems.Berlin,Germany:Springer,2015:2296-2304. [68] NEVEROVA N,WOLF C,TAYLOR G,et al.ModDrop:adaptive multi-modal gesture recognition[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2016,38(8):1692-1706. [69] JIN Qin,LIANG Junwei.Video description generation using audio and visual cues[C]//Proceedings of the 5th ACM International Conference on Multimedia Retrieval.New York,USA:ACM Press,2016:239-242. [70] VINYALS O,TOSHEV A,BENGIO S,et al.Show and tell:a neural image caption generator[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:3156-3164. [71] RAJAGOPALAN S S,MORENCY L P,BALTRUSAITIS T,et al.Extending long short-term memory for multi-view structured learning[C]//Proceedings of the 14th European Conference on Computer Vision.Berlin,Germany:Springer,2016:338-353. [72] ANDREJ K,LI F F.Deep visual-semantic alignments for generating image descriptions[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2017,39(4):664-676. [73] TAPASWI M,AUML M B,STIEFELHA R.Aligning plot synopses to videos for story-based retrieval[J].International Journal of Multimedia Information Retrieval,2015,4(1):3-16. [74] TAPASWI M,Auml M B,STIEFELHA R.Book2Movie:aligning video scenes with book chapters[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2015:1827-1835. [75] TRIGEORGIS G,NICOLAOU M A,ZAFEIRIOU S,et al.Deep canonical time warping[C]//Proceedings of Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:5110-5118. [76] PIOTR B,RÉMI L,EDOUARD G,et al.Weakly-supervised alignment of video with text[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:4462-4470. [77] ZHUY K,KIROS R,ZEMEL R,et al.Aligning books and movies:towards story-like visual explanations by watching movies and reading books[C]//Proceedings of IEEE International Conference on Computer Vision.Washington D.C.,USA:IEEE Press,2015:19-27. [78] MAO Junhua,HUANG J,TOSHEV A,et al.Generation and comprehension of unambiguous object descriptions[C]//Proceedings of Computer Vision and Pattern Recognition Conference.Washington D.C.,USA:IEEE Press,2016:11-20. [79] DENGY G,BYRNE W.HMM word and phrase alignment for statistical machine translation[C]//Proceedings of Conference on Human Language Technology and Empirical Methods in Natural.New York,USA:ACM Press,2005:169-176. [80] KELVIN X,JIMMY B,RYAN K,et al.Show,attend and tell:neural image caption generation with visual attention[C]//Proceedings of the 32nd International Conference on Machine Learning.New York,USA:ACM Press,2015:2048-2057. [81] YU Haonan,WANG Jiang,HUANG Zhiheng.Video paragraph captioning using hierarchical recurrent neural networks[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:4584-4593. [82] CHEN C,JAFARI R,KEHTARNAV N.UTD-MHAD:a multimodal data set for human action recognition utilizing a depth camera and a wearable inertial sensor[C]//Proceedings of 2015 IEEE International Conference on Image Processing.Washington D.C.,USA:IEEE Press,2015:168-172. [83] KHAIRE P,IMRAN J,KUMAR P.Human activity recognition by fusion of RGB,depth,and skeletal data[C]//Proceedings of the 2nd International Conference on Computer Vision and Image Processing.Washington D.C.,USA:IEEE Press,2018:409-421. [84] ESCALERA S,BARÓ X,GONZÀLEZ J.ChaLearn looking at people challenge 2014:dataset and results[C]//Proceedings of European Conference on Computer Vision.Berlin,Germany:Springer,2015:459-473. [85] OFLI F,CHAUDHRY R,KURILLO G,et al.Berkeley MHAD:a comprehensive multimodal human action databases[C]//Proceedings of 2013 IEEE Workshop on Applications of Computer Vision.Washington D.C.,USA:IEEE Press,2013:53-60. [86] NG H,VUN T T Y,TONG H L,et al.Action classification on the Berkeley multimodal human action dataset[J].Journal of Engineering and Applied Sciences,2017,12(3):520-526. [87] WANG Mei,DENG Weihong.Deep face recognition:a survey[EB/OL].[2020-01-07].https://arxiv.org/abs/1804.06655. [88] SITOVA Z,SEDENKA J,YANG Q,et al.HMOG:new behavioral biometric features for continuous authentication of smartphone users[J].IEEE Transactions on Information Forensics Security,2016,11(5):877-892. [89] RINGEVAL F,SONDEREGGER A,SAUER J,et al.Introducing the Recola multimodal corpus of remote collaborative and affective interactions[C]//Proceedings of the 10th IEEE International Conference on Automatic Face and Gesture Recognition.Washington D.C.,USA:IEEE Press,2013:1-8. [90] CHENG Yanfen,CHEN Yaoxin,CHEN Yiling,et al.Speech emotion recognition with attention mechanism and hierarchical context[J].Journal of Harbin University of Technology,2019,51(11):100-107.(in Chinese)程艳芬,陈垚鑫,陈逸灵,等.嵌入注意力机制并结合层级上下文的语音情感识别[J].哈尔滨工业大学学报,2019,51(11):100-107. [91] WU Q,TENEY D,WANG P,et al.Visual question answering:a survey of methods and datasets[J].Computer Vision and Image Understanding,2017,163:21-40. [92] MAO Junhua,XU Jiajing,JING Yushi,et al.Training and evaluating multimodal word embeddings with large-scale Web annotated images[C]//Proceedings of Advances in Neural Information Processing Systems.Barcelona,Spain:[s.n.],2016:442-450. [93] SEONG T W,IBRAHIM M Z.A review of audio-visual speech recognition[J].Journal of Telecommunication,Electronic and Computer Engineering,2018,10(1):35-40. [94] GEHRING J,AULI M,GRANGIER D,et al.Convolutional sequence to sequence learning[C]//Proceedings of the 34th International Conference on Machine Learning.New York,USA:ACM Press,2017:1243-1252. [95] MIN R,KOSE N,DUGELAY J L.KinectFaceDB:a Kinect database for face recognition[J].IEEE Transactions on Systems Man and Cybernetics,2014,44(11):1534-1548. [96] WEI Wei,JIA Qingxuan.3D Facial expression recognition based on Kinect[J].International Journal of Innovative Computing Information and Control,2017,13(6):1843-1854. [97] PENG Yuxin,HUANG Xin,QI Jinwei.Cross-media shared representation by hierarchical learning with multiple deep networks[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence.San Francisco,USA:Morgan Kaufmann Press,2016:3846-3853. |