基于虚拟化的多GPU深度神经网络训练框架

计算机工程

基于虚拟化的多GPU深度神经网络训练框架

杨志刚,吴俊敏,徐恒,尹燕

(1.中国科学技术大学计算机科学与技术学院,合肥 230022;2.中国科学技术大学 a.苏州研究院; b.软件学院,江苏苏州 215123)

收稿日期:2017-01-09 出版日期:2018-02-15 发布日期:2018-02-15
作者简介:杨志刚(1992—),男,硕士,主研方向为并行计算机体系结构、GPU并行加速;吴俊敏,副教授、博士;徐恒,硕士;尹燕,博士。
基金资助:
国家重点研发计划项目“面向异构融合数据流加速器的运行时系统”(2016YFB1000403)。

Training Framework of Multi-GPU Deep Neural Network Based on Virtualization

YANG Zhigang ¹,WU Junmin ^1,2a,2b,XU Heng ^2b,YIN Yan¹

(1.School of Computer Science and Technology,University of Science and Technology of China,Hefei 230022,China; 2a.Suzhou Institute; 2b.School of Software,University of Science and Technology of China,Suzhou,Jiangsu 215123,China)

Received:2017-01-09 Online:2018-02-15 Published:2018-02-15

摘要/Abstract

摘要： 针对深度神经网络在分布式多机多GPU上的加速训练问题,提出一种基于虚拟化的远程多GPU调用的实现方法。利用远程GPU调用部署的分布式GPU集群改进传统一对一的虚拟化技术,同时改变深度神经网络在分布式多GPU训练过程中的参数交换的位置,达到两者兼容的目的。该方法利用分布式环境中的远程GPU资源实现深度神经网络的加速训练,且达到单机多GPU和多机多GPU在CUDA编程模式上的统一。以手写数字识别为例,利用通用网络环境中深度神经网络的多机多GPU数据并行的训练进行实验,结果验证了该方法的有效性和可行性。

关键词: 虚拟化, 深度神经网络, 分布式, 多机多GPU, 数据并行, 手写数字识别

Abstract: Aiming at the problem of deep neural network speeding up training on distributed multi-machine and multi-GPU,this paper proposes an implementation method of remote multi-GPUs calls based on virtualization.The distributed GPU clusters deployed by remote GPU calls improve the traditional one-to-one virtualization technology and change the location of the deep neural network for parameter exchange during distributed multi-GPU training,achieve the compatibility between the two.The method utilizes the remote GPU resources in a distributed environment to speed up the training of deep neural networks,and reaches the unification of CUDA programming modes of single GPU and multi-GPU.Taking handwritten numeral recognition as an example,experiments are carried out on the parallel training of multi-GPU and multi-GPU data in the deep network of general network environment,results verify the effectiveness and feasibility of the method.

Key words: virtualization, deep neural network, distributed, multi-machine and multi-GPU, data parallel, handwritten numeral recognition

中图分类号:

TP391

杨志刚,吴俊敏,徐恒,尹燕. 基于虚拟化的多GPU深度神经网络训练框架[J]. 计算机工程.

YANG Zhigang,WU Junmin,XU Heng,YIN Yan. Training Framework of Multi-GPU Deep Neural Network Based on Virtualization[J]. Computer Engineering.

参考文献

参考文献［1］张玉洁,吕相文,张云洲.GPU虚拟化环境下的数据通信策略研究［J］.计算机技术与发展,2015,25(8):24-28. ［2］SHI L,CHEN H,SUN J,et al.vCUDA:GPU-accelerated High-performance Computing in Virtual Machines［J］.IEEE Transactions on Computers,2012,61(6):804-816. ［3］DUATO J,PENA A J,SILLA F,et al.rCUDA:Reducing the Number of GPU-based Accelerators in High Performance Clusters［C］//Proceedings of 2010 IEEE International Conference on High Performance Computing and Simulation.Washington D.C.,USA:IEEE Press,2010:224-231. ［4］杨经纬,马凯,龙翔.面向集群环境的虚拟化GPU计算平台［J］.北京航空航天大学学报,2016,42(11):2340-2348. ［5］盛冲冲,胡新明,李佳佳,等.面向节点异构 GPU 集群的编程框架［J］.计算机工程,2015,41(2):292-297. ［6］HINTON G E,SALAKHUTDINOV R R.Reducing the Dimensionality of Data with Neural Networks［J］.Science,2006,313(5786):504-507. ［7］DEAN J,CORRADO G,MONGA R,et al.Large Scale Distributed Deep Networks［C］//Proceedings of IEEE ANIPS’12.Washington D.C.,USA:IEEE Press,2012:1223-1231. ［8］ZOU Y,JIN X,LI Y,et al.Mariana:Tencent Deep Learning Platform and Its Applications［J］.Proceedings of the VLDB Endowment,2014,7(13):1772-1777. (下转第83页) (上接第74页) ［9］YADAN O,ADAMS K,TAIGMAN Y,et al.Multi-gpu Training of Convnets［EB/OL］.(2013-05-23).https://wenku.baidu.com/view/c2121ee0aaea998fcd220e95.html. ［10］POVEY D,ZHANG X,KHUDANPUR S.Parallel Training of DNNs with Natural Gradient and Parameter Averaging［EB/OL］.(2014-05-21).http://www.itsoc.org/publications/arxiv/arxiv-faq. ［11］SOUROURI M,GILLBERG T,BADEN S B,et al.Effective Multi-GPU Communication Using Multiple CUDA Streams and Threads［C］//Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems.Washington D.C.,USA:IEEE Press,2014:981-986. ［12］王刚,唐杰,武港山.基于多 GPU 集群的编程框架［J］.计算机技术与发展,2014,24(1):9-13. ［13］闵芳,张志先,张玉洁.虚拟化环境下多 GPU 并行计算研究［J］.微电子学与计算机,2016,33(3):69-75. ［14］张玉洁.基于多 GPGPU 并行计算的虚拟化技术研究［D］.南京:南京航空航天大学,2015. ［15］ELLIOTT G A,WARD B C,ANDERSON J H.GPUSync:A Framework for Real-time GPU Management［C］//Proceed-ings of RTSS’13.Washington D.C.,USA:IEEE Press,2013:33-44. ［16］STUART J A,OWENS J D.Multi-GPU MapReduce on GPU Clusters［C］//Proceedings of IEEE International on Parallel & Distributed Processing Symposium.Washington D.C.,USA:IEEE Press,2011:1068-1079. 编辑索书志

[1]	郭佩林, 张德, 王怀秀. 基于特征可视化探究跳跃连接结构对深度神经网络特征提取的影响[J]. 计算机工程, 2025, 51(4): 149-157.
[2]	杜松霖, 仵大奎, 余云涛, 刘亚, 周文举. 基于协同优化算法的分布式装配车间调度[J]. 计算机工程, 2025, 51(3): 274-282.
[3]	赵宏, 宋馥荣, 李文改. 基于SE-AdvGAN的图像对抗样本生成方法研究[J]. 计算机工程, 2025, 51(2): 300-311.
[4]	黄舒怡, 谭光. 基于分区的高效视频目标检测[J]. 计算机工程, 2025, 51(2): 65-77.
[5]	刘怡, 张磊. 基于LT码的分布式矩阵计算研究[J]. 计算机工程, 2024, 50(8): 328-335.
[6]	郑清安, 董建成, 陈亮, 阮英清, 李锦松, 许林彬. 分布式可信数据管理与隐私保护技术研究[J]. 计算机工程, 2024, 50(7): 174-186.
[7]	宫阿娟, 潘天荣. 多病种眼底疾病诊断的深度学习策略讨论[J]. 计算机工程, 2024, 50(5): 363-372.
[8]	杨思捷, 陈俊奇, 王勇, 李树林. 基于FPGA的软硬件协同纠删码编码加速方案[J]. 计算机工程, 2024, 50(2): 224-231.
[9]	刘帅威, 李智, 王国美, 张丽. 基于Transformer和GAN的对抗样本生成算法[J]. 计算机工程, 2024, 50(2): 180-187.
[10]	宋艳蕊, 庄雷, 徐泽汐, 冯旭, 莫文帅. 基于云边协同的可靠服务功能链部署算法[J]. 计算机工程, 2024, 50(12): 184-193.
[11]	申秀雨, 姬伟峰, 李映岐, 吴玄. 面向边缘计算的TCA1C DDoS检测模型[J]. 计算机工程, 2024, 50(1): 198-205.
[12]	胡宗升, 邢凯, 许静. 基于超越数论的无线传感器网络时空编码方法[J]. 计算机工程, 2023, 49(9): 172-182.
[13]	张冠莹, 伊鹏, 李丹, 朱棣, 毛明. 面向大规模网络的服务功能链部署方法[J]. 计算机工程, 2023, 49(8): 122-129.
[14]	靳雁霞, 史志儒, 杨晶, 刘亚变, 乔星宇, 张翎. 布料与精细建模物体间的碰撞检测算法研究[J]. 计算机工程, 2023, 49(7): 269-277.
[15]	陈锐, 孙羽菲, 郭强, 隋轶丞, 周振辉, 石昌青, 张玉志. OclDNN:一种可应用于TensorFlow的通用DNN库[J]. 计算机工程, 2023, 49(4): 138-148.

选择文件类型/文献管理软件名称

选择包含的内容