计算机工程 ›› 2020, Vol. 46 ›› Issue (5): 1-11.doi: 10.19678/j.issn.1000-3428.0057370

• 热点与综述 • 上一篇    下一篇

面向深度学习的多模态融合技术研究综述

何俊1, 张彩庆2a, 李小珍1, 张德海2b   

  1. 1. 昆明学院 信息工程学院, 昆明 650214;
    2. 云南大学 a. 外国语学院;b. 软件学院, 昆明 650206
  • 收稿日期:2020-02-11 修回日期:2020-03-13 发布日期:2020-03-20
  • 作者简介:何俊(1977-),男,副教授、博士,主研方向为机器学习、软件演化、数据分析;张彩庆,讲师、硕士;李小珍,讲师、博士;张德海,副教授、博士。
  • 基金项目:
    国家自然科学基金(61263043,61864004);云南省地方本科高校基础研究联合专项(2017FH001-05)。

Survey of Research on Multimodal Fusion Technology for Deep Learning

HE Jun1, ZHANG Caiqing2a, LI Xiaozhen1, ZHANG Dehai2b   

  1. 1. College of Information Engineering,Kunming University,Kunming 650214, China;
    2a. College of Foreign Languages;2b. College of Software, Yunnan University, Kunming 650206, China
  • Received:2020-02-11 Revised:2020-03-13 Published:2020-03-20

摘要: 面向深度学习的多模态融合技术是指机器从文本、图像、语音和视频等领域获取信息实现转换与融合以提升模型性能,而模态的普遍性和深度学习的热度促进了多模态融合技术的发展。在多模态融合技术发展前期,以提升深度学习模型分类与回归性能为出发点,阐述多模态融合架构、融合方法和对齐技术。重点分析联合、协同、编解码器3种融合架构在深度学习中的应用情况与优缺点,以及多核学习、图像模型和神经网络等具体融合方法与对齐技术,在此基础上归纳多模态融合研究的常用公开数据集,并对跨模态转移学习、模态语义冲突消解、多模态组合评价等下一步的研究方向进行展望。

关键词: 深度学习, 多模态, 模态融合, 模态对齐, 多核学习, 图像模型

Abstract: Multimodal Fusion Technology(MFT) for Deep Learning(DL) refers to the conversion and fusion of information obtained by machine from texts,images,voices,videos and other materials,so as to improve the performance of the model.The universality of modals and the heat of DL boost the rapid development of multimodal fusion.In order to improve the performance of DL model classification or regression,this paper summarizes the multimodal fusion architecture,fusion methods and alignment technologies in the early stage of MFT development.This paper focuses on the analysis of the three fusion architectures:joint,cooperative and codec architectures,in terms of their adoption in DL and advantages/disadvantages.The specific fusion methods and alignment technologies such as Multiple Kernel Learning(MKL),Graphic Model(GM) and Neural Network(NN) are also studied.Finally,the public datasets commonly used in multimodal fusion research are summarized,and the direction of further research in cross-modal transfer learning,resolution of modal semantic conflicts,and multimodal combination evaluation is prospected.

Key words: Deep Learning(DL), multimodality, modal fusion, modal alignment, Multiple Kernel Learning(MKL), Graphical Model(GM)

中图分类号: