一种基于分布式编码的同步梯度下降算法

doi:10.19678/j.issn.1000-3428.0057340

摘要/Abstract

摘要： 基于数据并行化的异步随机梯度下降（ASGD）算法由于需要在分布式计算节点之间频繁交换梯度数据，从而影响算法执行效率。提出基于分布式编码的同步随机梯度下降（SSGD）算法，利用计算任务的冗余分发策略对每个节点的中间结果传输时间进行量化以减少单一批次训练时间，并通过数据传输编码策略的分组数据交换模式降低节点间的数据通信总量。实验结果表明，当配置合适的超参数时，与SSGD和ASGD算法相比，该算法在深度神经网络和卷积神经网络分布式训练中平均减少了53.97%、26.89%和39.11%、26.37%的训练时间，从而证明其能有效降低分布式集群的通信负载并保证神经网络的训练精确度。

关键词: 神经网络, 深度学习, 分布式编码, 梯度下降, 通信负载

Abstract: The Asynchronized Stochastic Gradient Descent(ASGD) algorithm based on data parallelization require frequent gradient data exchanges between distributed computing nodes,which affects the execution efficiency of the algorithm.This paper proposes a Synchronized Stochastic Gradient Descent(SSGD) algorithm based on distributed coding.The algorithm uses the redundancy allocation strategy of computation tasks to quantify the intermediate transmission time of each node,and thus reduces the consumed time for training of a single batch.Then the amount of data transmitted between nodes is reduced by using the grouped data exchange mode of the coding strategy for data communication.Experimental results show that with a suitable hyper parameter configuration,the proposed algorithm can reduce the average distributed training time of Deep Neural Network(DNN) and Convolutional Neural Network(CNN) by 53.97% and 26.89% compared with the SSGD algorithm,and by 39.11% and 26.37% compared with the ASGD algorithm.It can significantly reduce the communication loads of the distributed cluster and ensures the training accuracy of neural networks.

Key words: neural network, deep learning, distributed coding, Gradient Descent(GD), communication load

中图分类号:

TP311

李博文, 谢在鹏, 毛莺池, 徐媛媛, 朱晓瑞, 张基. 一种基于分布式编码的同步梯度下降算法[J]. 计算机工程, 2021, 47(4): 68-76,83.

LI Bowen, XIE Zaipeng, MAO Yingchi, XU Yuanyuan, ZHU Xiaorui, ZHANG Ji. A Synchronized Gradient Descent Algorithm Based on Distributed Coding[J]. Computer Engineering, 2021, 47(4): 68-76,83.

https://www.ecice06.com/CN/Y2021/V47/I4/68

图/表 17

20210425165155

20210425165158

20210425165201

20210425165204

20210425165207

20210425165210

20210425165214

20210425165218

20210425165221

20210425165224

20210425165227

20210425165230

20210425165233

20210425165235

20210425165238

20210425165241

20210425165244

参考文献

[1] SAK H,SENIOR A,BEAUFAYS F.Long short-term memory based recurrent neural network architectures for large vocabulary speech recognition[EB/OL].[2020-01-04].https://arxiv.org/abs/1402.1128.
[2] SERCU T,PUHRSCH C,KINGSBURY B,et al.Very deep multilingual convolutional neural networks for LVCSR[C]//Proceedings of 2016 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2016:4955-4959.
[3] KRIZHEVSKY A,SUTSKEVER I,GEOFFREY E H.ImageNet classification with deep convolutional neural networks[EB/OL].[2020-01-04].https://blog.csdn.net/yuanchheneducn/article/details/50161047.
[4] HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition.Washington D.C.,USA:IEEE Press,2016:770-778.
[5] MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Proceedings of International Conference on Neural Information Processing Systems.Washington D.C.,USA:IEEE Press,2013:3111-3119.
[6] KINGMA D P.ADAM:a method for stochastic optimization[C]//Proceedings of International Conference on Learning Representations.Washington D.C.,USA:IEEE Press,2015:1-7.
[7] LUO Liangchen,XIONG Yuanhao,LIU Yan,et al.Adaptive gradient method with dynamic bound of learning rate[C]//Proceedings of International Conference on Learning Representations.Washington D.C.,USA:IEEE Press,2019:15-25.
[8] DEAN J,CORRADO G S,MONGA R,et al.Large scale distributed deep networks[C]//Proceedings of International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2013:1223-1231.
[9] POVEY D,ZHANG X H,KHUDANPUR S.Parallel training of DNNs with natural gradient and parameter averaging[C]//Proceedings of International Conference on Learning Representations.New York,USA:ACM Press,2015:7-18.
[10] NIU F,ECHT B.HOGWILD!:a lock-free approach to parallelizing stochastic gradient descent[C]//Proceedings of the 25th Conference on Neural Information Processing Systems.New York,USA:ACM Press,2011:693-701.
[11] DAI Wei,ZHOU Yi,DONG Nanqing,et al.Toward understanding the impact of staleness in distributed machine learning[C]//Proceedings of International Conference on Learning Representations.New York,USA:ACM Press,2019:1-8.
[12] ZHENG Shuxin,MENG Qi,WANG Taifeng,et al.Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of International Conference on Machine Learning.New York,USA:ACM Press,2017:28-45.
[13] LI S Z,MADDAH-ALI M A,YU Q,et al.A fundamental tradeoff between computation and communication in distributed computing[J].IEEE Transactions on Information Theory,2018,64(1):109-128.
[14] LI S Z,MADDAH-ALI M A.Compressed coded distributed computing[C]//Proceedings of 2018 IEEE International Symposium on Information Theory.Washington D.C.,USA:IEEE Press,2018:2032-2036.
[15] FERDINAND N,AL-LAWATI H,DRAPER S,et al.Anytime minibatch:exploiting stragglers in online distributed optimization[EB/OL].[2020-01-04].https://arxiv.org/abs/2006.05752.
[16] FERDINAND N,GHARACHORLOO B,DRAPER S C.Anytime exploitation of stragglers in synchronous stochastic gradient descent[C]//Proceedings of the 16th IEEE International Conference on Machine Learning and Applications.Washington D.C.,USA:IEEE Press,2017:141-146.
[17] YU Q,MADDAH-ALI M A.Straggler mitigation in distributed matrix multiplication:fundamental limits and optimal coding[C]//Proceedings of IEEE International Symposium on Information Theory.Washington D.C.,USA:IEEE Press,2018:2157-2162.
[18] YU Q,MADDAH-ALI M A.Polynomial codes:an optimal design for high-dimensional coded matrix multiplication[C]//Proceedings of the 31st Conference on Neural Information Processing Systems.New York,USA:ACM Press,2017:4406-4416.
[19] SEIDE F,FU H,DROPPO J,et al.1-Bit stochastic gradient descent and its application to data-parallel distributed training of speech DNNs[EB/OL].[2020-01-04].https://www.cnblogs.com/littleorange/p/12674552.html
[20] CHEN Kai,HUO Qiang.Scalable train of deep learning machines by incremental block training with intra-block parallel optimization and blockwise model-update filtering[C]//Proceedings of International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2016:2379-2384.
[21] ASSRAN M,LOIZOU N,BALLAS N,et al.Stochastic gradient push for distributed deep learning[EB/OL].[2020-01-04].https://arxiv.org/abs/1811.10792.
[22] LEE K,LAM M,PEDARSANI R,et al.Speeding up distributed machine learning using codes[J].IEEE Transactions on Information Theory,2018,64(3):1514-1529.

选择文件类型/文献管理软件名称

选择包含的内容