面向深度学习图像分类的GPU并行方法研究

doi:10.19678/j.issn.1000-3428.0062607

摘要/Abstract

摘要： 针对深度学习图像分类场景中多GPU并行后传输效率低的问题，提出一种低时间复杂度的Ring All Reduce改进算法。通过分节点间隔配对原则优化数据传输流程，缓解传统参数服务器并行结构的带宽损耗。基于数据并行难以支撑大规模网络参数及加速延缓的问题，根据深度学习主干网络所包含的权重参数低于全连接层权重参数、同步开销小、全连接层权重大与梯度传输开销过高等特点，提出GPU混合并行优化算法，将主干网络进行数据并行，全连接层进行模型并行，并通过改进的Ring All Reduce算法实现各节点之间的并行后数据通信，用于基于深度学习模型的图像分类。在Cifar10和mini ImageNet两个公共数据集上的实验结果表明，该算法在保持分类精度不变的情况下可以获得更好的加速效果，相比数据并行方法，可达到近45%的提升效果。

关键词: GPU并行, Ring All Reduce算法, 数据并行, 模型并行, 深度学习, 图像分类

Abstract: This paper proposes an improved ring all-reduce algorithm with lower time complexity to improve the low efficiency of multiple Graphics Processing Unit(GPU) parallel transmission in deep-learning image classification scenes.This algorithm optimizes the data transmission process based on interval pairing among nodes to alleviate the bandwidth loss of the traditional parameter server parallel structure.Second, as data parallelism makes it challenging to support large-scale network parameters and accelerate the delay, this paper utilizes the characteristics that the weight parameters of the deep learning backbone network are lower than those of the full connection layer, the synchronization overhead is small, the weight of the full connection layer is large, and the gradient transmission overhead is very high. The backbone network data is parallel, and the model of the full connection layer is parallel.The parallel data communication between nodes is achieved using the improved Ring All Reduce algorithm, which is used for image classification training.Experiments were conducted on two widely used datasets:Cifar10 and mini ImageNet.The experimental results show that the proposed algorithm can achieve a better acceleration effect while maintaining classification accuracy.Compared with data parallelism, the improved effect can almost reach 45%.

Key words: GPU parallel, Ring All Reduce algorithm, data parallelism, model parallelism, deep learning, image classification

中图分类号:

TP338.6

韩彦岭, 沈思扬, 徐利军, 王静, 张云, 周汝雁. 面向深度学习图像分类的GPU并行方法研究[J]. 计算机工程, 2023, 49(1): 191-200.

HAN Yanling, SHEN Siyang, XU Lijun, WANG Jing, ZHANG Yun, ZHOU Ruyan. GPU Parallel Method for Deep Learning Image Classification[J]. Computer Engineering, 2023, 49(1): 191-200.

https://www.ecice06.com/CN/Y2023/V49/I1/191

图/表 17

20230701180329

20230701180332

20230701180335

20230701180339

20230701180343

20230701180346

20230701180349

20230701180353

20230701180357

20230701180400

20230701180403

20230701180407

20230701180410

20230701180414

20230701180417

20230701180422

20230701180426

参考文献

[1] KANG L, YE P, LI Y, et al.Convolutional neural networks for no-reference image quality assessment[C]//Proceedings of 2014 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2014:1733-1740.
[2] JIAO L C, ZHAO J, YANG S Y, et al.Deep learning, optimization and recognition[M].Beijing:Tsinghua University Press, 2017.
[3] 刘凯, 林基明, 郑霖, 等.基于深度自编码网络的慢速移动目标检测[J].计算机工程, 2018, 44(2):129-134. LIU K, LIN J M, ZHENG L, et al.Slow moving target detection based on deep self-coding network[J].Computer Engineering, 2018, 44(2):129-134.(in Chinese)
[4] CHEN L C, PAPANDREOU G, KOKKINOS I, et al.Semantic image segmentation with deep convolutional nets and fully connected CRFs[EB/OL].[2021-08-01].https://arxiv.org/abs/1606.00915v1.
[5] SCHROFF F, KALENICHENKO D, PHILBIN J.FaceNet:a unified embedding for face recognition and clustering[C]//Proceedings of 2015 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2015:815-823.
[6] CHEN X Z, KUNDU K, ZHANG Z Y, et al.Monocular 3D object detection for autonomous driving[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:2147-2156.
[7] DEAN J, CORRADO G S, MONGA R, et al.Large scale distributed deep networks[C]//Proceedings of Advances in Neural Information Processing Systems.Cambridge, USA:MIT Press, 2012:1223-1231.
[8] KRIZHEVSKY A, SUTSKEVER I, HINTON G E.ImageNet classification with deep convolutional neural networks[J].Communications of the ACM, 2017, 60(6):84-90.
[9] SIMONYAN K, ZISSERMAN A.Very deep convolutional networks for large-scale image recognition[EB/OL].[2021-08-01].https://arxiv.org/abs/1409.1556.
[10] ABADI M, AGARWAL A, BARHAM P, et al.TensorFlow:large-scale machine learning on heterogeneous distributed systems[EB/OL].[2021-08-01].https://arxiv.org/abs/1603.04467v2.
[11] SERGEEV A, DEL BALSO M.Horovod:fast and easy distributed deep learning in TensorFlow[EB/OL].[2021-08-01].https://arxiv.org/abs/1802.05799.
[12] HOFFER E, HUBARA I, SOUDRY D.Train longer, generalize better:closing the generalization gap in large batch training of neural networks[EB/OL].[2021-08-01].https://arxiv.org/abs/1705.08741.
[13] SEIDE F, FU H, DROPPO J, et al.On parallelizability of stochastic gradient descent for speech DNNS[C]//Proceedings of 2014 IEEE International Conference on Acoustics, Speech and Signal Processing.Washington D.C., USA:IEEE Press, 2014:235-239.
[14] WU Y H, SCHUSTER M, CHEN Z F, et al.Google's neural machine translation system:bridging the gap between human and machine translation[EB/OL].[2021-08-01].https://arxiv.org/abs/1609.08144.
[15] SZEGEDY C, VANHOUCKE V, IOFFE S, et al.Rethinking the inception architecture for computer vision[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2016:2818-2826.
[16] BENGIO S, NOROUZI M, STEINER B, et al.Device placement optimization with reinforcement learning:USA, 2018175972A1[P].2018-03-23.
[17] MIRHOSEINI A, GOLDIE A, PHAM H, et al.A hierarchical model for device placement[C]//Proceedings of International Conference on Learning Representations.Washington D.C., USA:IEEE Press, 2018:246-258.
[18] COATES A, HUVAL B, WANG T, et al.Deep learning with COTS HPC systems[C]//Proceedings of the 30th International Conference on International Conference on Machine Learning.Washington D.C., USA:IEEE Press, 2013:568-577.
[19] LI M.Scaling distributed machine learning with the parameter server[C]//Proceedings of 2014 International Conference on Big Data Science and Computing.New York, USA:ACM Press, 2014:264-275.
[20] GIBIANSKY A.Bringing HPC techniques to deep learning[EB/OL].[2021-08-01].http://research.baidu.com/bringing-hpc-techniques-deep-learning/.
[21] DENG J, DONG W, SOCHER R, et al.ImageNet:a large-scale hierarchical image database[C]//Proceedings of 2009 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C., USA:IEEE Press, 2009:248-255.
[22] LIAN X R, ZHANG C, ZHANG H, et al.Can decentralized algorithms outperform centralized algorithms?A case study for decentralized parallel stochastic gradient descent[EB/OL].[2021-08-01].https://arxiv.org/abs/1705.09056.
[23] SHI S H, WANG Q, CHU X W.Performance modeling and evaluation of distributed deep learning frameworks on GPUs[EB/OL].[2021-08-01].https://arxiv.org/abs/1711.05979.
[24] NCCL.NVIDIA Collective Communications Library[EB/OL].[2021-08-01].https://developer.nvidia.com/nccl.
[25] NVIDIA.Nvidia Cuda C Programming Guide[EB/OL].[2021-08-01].https://zhuanlan.zhihu.com/p/53773183.

选择文件类型/文献管理软件名称

选择包含的内容