作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 体系结构与软件技术 • 上一篇    下一篇

基于虚拟化的多GPU深度神经网络训练框架

杨志刚,吴俊敏,徐恒,尹燕   

  1. (1.中国科学技术大学 计算机科学与技术学院,合肥 230022;2.中国科学技术大学 a.苏州研究院; b.软件学院,江苏 苏州 215123)
  • 收稿日期:2017-01-09 出版日期:2018-02-15 发布日期:2018-02-15
  • 作者简介:杨志刚(1992—),男,硕士,主研方向为并行计算机体系结构、GPU并行加速;吴俊敏,副教授、博士;徐恒,硕士;尹燕,博士。
  • 基金资助:
    国家重点研发计划项目“面向异构融合数据流加速器的运行时系统”(2016YFB1000403)。

Training Framework of Multi-GPU Deep Neural Network Based on Virtualization

YANG Zhigang  1,WU Junmin  1,2a,2b,XU Heng  2b,YIN Yan  1   

  1. (1.School of Computer Science and Technology,University of Science and Technology of China,Hefei 230022,China; 2a.Suzhou Institute; 2b.School of Software,University of Science and Technology of China,Suzhou,Jiangsu 215123,China)
  • Received:2017-01-09 Online:2018-02-15 Published:2018-02-15

摘要: 针对深度神经网络在分布式多机多GPU上的加速训练问题,提出一种基于虚拟化的远程多GPU调用的实现方法。利用远程GPU调用部署的分布式GPU集群改进传统一对一的虚拟化技术,同时改变深度神经网络在分布式多GPU训练过程中的参数交换的位置,达到两者兼容的目的。该方法利用分布式环境中的远程GPU资源实现深度神经网络的加速训练,且达到单机多GPU和多机多GPU在CUDA编程模式上的统一。以手写数字识别为例,利用通用网络环境中深度神经网络的多机多GPU数据并行的训练进行实验,结果验证了该方法的有效性和可行性。

关键词: 虚拟化, 深度神经网络, 分布式, 多机多GPU, 数据并行, 手写数字识别

Abstract: Aiming at the problem of deep neural network speeding up training on distributed multi-machine and multi-GPU,this paper proposes an implementation method of remote multi-GPUs calls based on virtualization.The distributed GPU clusters deployed by remote GPU calls improve the traditional one-to-one virtualization technology and change the location of the deep neural network for parameter exchange during distributed multi-GPU training,achieve the compatibility between the two.The method utilizes the remote GPU resources in a distributed environment to speed up the training of deep neural networks,and reaches the unification of CUDA programming modes of single GPU and multi-GPU.Taking handwritten numeral recognition as an example,experiments are carried out on the parallel training of multi-GPU and multi-GPU data in the deep network of general network environment,results verify the effectiveness and feasibility of the method.

Key words: virtualization, deep neural network, distributed, multi-machine and multi-GPU, data parallel, handwritten numeral recognition

中图分类号: