作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (1): 191-200. doi: 10.19678/j.issn.1000-3428.0062607

• 体系结构与软件技术 • 上一篇    下一篇

面向深度学习图像分类的GPU并行方法研究

韩彦岭, 沈思扬, 徐利军, 王静, 张云, 周汝雁   

  1. 上海海洋大学 信息学院, 上海 201306
  • 收稿日期:2021-09-06 修回日期:2022-02-14 发布日期:2022-03-22
  • 作者简介:韩彦岭(1975-),女,教授、博士,主研方向为大数据技术、高光谱遥感图像分类;沈思扬,硕士研究生;徐利军(通信作者),讲师;王静,讲师、博士;张云,教授、博士;周汝雁,副教授、博士。
  • 基金资助:
    国家重点研发计划“蓝色粮仓科技创新”重点专项(2019YFD0900805);国家自然科学基金(42176175)。

GPU Parallel Method for Deep Learning Image Classification

HAN Yanling, SHEN Siyang, XU Lijun, WANG Jing, ZHANG Yun, ZHOU Ruyan   

  1. College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
  • Received:2021-09-06 Revised:2022-02-14 Published:2022-03-22

摘要: 针对深度学习图像分类场景中多GPU并行后传输效率低的问题,提出一种低时间复杂度的Ring All Reduce改进算法。通过分节点间隔配对原则优化数据传输流程,缓解传统参数服务器并行结构的带宽损耗。基于数据并行难以支撑大规模网络参数及加速延缓的问题,根据深度学习主干网络所包含的权重参数低于全连接层权重参数、同步开销小、全连接层权重大与梯度传输开销过高等特点,提出GPU混合并行优化算法,将主干网络进行数据并行,全连接层进行模型并行,并通过改进的Ring All Reduce算法实现各节点之间的并行后数据通信,用于基于深度学习模型的图像分类。在Cifar10和mini ImageNet两个公共数据集上的实验结果表明,该算法在保持分类精度不变的情况下可以获得更好的加速效果,相比数据并行方法,可达到近45%的提升效果。

关键词: GPU并行, Ring All Reduce算法, 数据并行, 模型并行, 深度学习, 图像分类

Abstract: This paper proposes an improved ring all-reduce algorithm with lower time complexity to improve the low efficiency of multiple Graphics Processing Unit(GPU) parallel transmission in deep-learning image classification scenes.This algorithm optimizes the data transmission process based on interval pairing among nodes to alleviate the bandwidth loss of the traditional parameter server parallel structure.Second, as data parallelism makes it challenging to support large-scale network parameters and accelerate the delay, this paper utilizes the characteristics that the weight parameters of the deep learning backbone network are lower than those of the full connection layer, the synchronization overhead is small, the weight of the full connection layer is large, and the gradient transmission overhead is very high. The backbone network data is parallel, and the model of the full connection layer is parallel.The parallel data communication between nodes is achieved using the improved Ring All Reduce algorithm, which is used for image classification training.Experiments were conducted on two widely used datasets:Cifar10 and mini ImageNet.The experimental results show that the proposed algorithm can achieve a better acceleration effect while maintaining classification accuracy.Compared with data parallelism, the improved effect can almost reach 45%.

Key words: GPU parallel, Ring All Reduce algorithm, data parallelism, model parallelism, deep learning, image classification

中图分类号: