一个轻量级分布式机器学习系统的设计与实现

doi:10.19678/j.issn.1000-3428.0054014

计算机工程 ›› 2020, Vol. 46 ›› Issue (1): 201-207. doi: 10.19678/j.issn.1000-3428.0054014

一个轻量级分布式机器学习系统的设计与实现

宋匡时^1,2, 李翀¹, 张士波¹

1. 中国科学院计算机网络信息中心, 北京 100190;
2. 中国科学院大学计算机科学与技术学院, 北京 100190

收稿日期:2019-02-26 修回日期:2019-04-11 出版日期:2020-01-15 发布日期:2019-05-22
作者简介:宋匡时(1994-),男,硕士研究生,主研方向为分布式系统;李翀,副研究员;张士波,高级工程师。
基金资助:
中国科学院"十三五"信息化重大专项"中国科学院科研教育态势感知服务"（XXH13504-03）。

Design and Implementation of a Lightweight Distributed Machine Learning System

SONG Kuangshi^1,2, LI Chong¹, ZHANG Shibo¹

1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
2. College of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100190, China

Received:2019-02-26 Revised:2019-04-11 Online:2020-01-15 Published:2019-05-22

摘要/Abstract

摘要： 为满足大规模机器学习系统高定制化、低耦合与低资源消耗的需求，设计并实现一个轻量级分布式机器学习系统。采用模块化分层设计并移植多种主流的机器学习与深度学习算法，同时提出参数服务器与动态Ring-AllReduce 2种可扩展梯度同步方案，对算法模型进行并行训练加速。实验结果表明，该系统对于稀疏与稠密模型均有较好的扩展性与稳定性，参数服务器训练可达到与单机相近的准确率与收敛效果，Ring-AllReduce也能在8节点模型上实现相对单节点模型6倍的训练加速。

关键词: 机器学习系统, 分布式系统, 并行计算, 集合通信, 模块化

Abstract: To improve customization and reduce coupling and resource consumption in large-scale machine learning systems,this paper designs and realizes a lightweight distributed machine learning system with high performance and scalability.The system adopts a modular and layered design,and migrates a variety of mainstream machine learning and deep learning algorithms.Two kinds of efficient extensible gradient synchronization schemes,parameter server and Ring All-Reduce,are proposed to perform parallel training acceleration experiments on the mainstream algorithm models.Experimental results show that the system can achieve excellent scalability and stability for both sparse and dense models.Parameter Server training can achieve similar accuracy and convergence performance to those of standalone training.The proposed Ring All-Reduce can achieve a 6x training acceleration on 8-node model compared with single node model.

Key words: machine learning system, distributed system, parallel computing, collective communication, modularity

中图分类号:

TP18

宋匡时, 李翀, 张士波. 一个轻量级分布式机器学习系统的设计与实现[J]. 计算机工程, 2020, 46(1): 201-207.

SONG Kuangshi, LI Chong, ZHANG Shibo. Design and Implementation of a Lightweight Distributed Machine Learning System[J]. Computer Engineering, 2020, 46(1): 201-207.

https://www.ecice06.com/CN/Y2020/V46/I1/201

图/表 11

20200115110942

20200115110944

20200115110947

20200115110951

20200115110953

20200115110956

20200115110959

20200115111002

20200115111005

20200115111008

20200115111010

参考文献

[1] LI M,ANDERSEN D G,PARK J W,et al.Scaling distributed machine learning with the parameter server[C]//Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation.[S.l.]:USENIX,2014:583-598.
[2] Baidu.Baidu-allreduce[EB/OL].[2019-02-10].https://github.com/baidu-research/baidu-allreduce.
[3] RECHT B,RE C,WRIGHT S,et al.Hogwild:a lock-free approach to parallelizing stochastic gradient descent[C]//Proceedings of NIPS'11.Berlin,Germany:Springer,2011:693-701.
[4] HO Q,CIPAR J,CUI H,et al.More effective distributed ML via a stale synchronous parallel parameter server[C]//Proceedings of NJPS'13.Berlin,Germany:2013:1223-1231.
[5] HPC milestone:the IBM POWER8 server is connected to the Tesla P100 via NVLINK[J].Intelligent Manufacturing,2016(9):49.(in Chinese) HPC里程碑:IBM POWER8服务器通过NVLINK与Tesla P100互联[J].智能制造,2016(9):49.
[6] WEI Xingda,CHEN Rong,CHEN Haibo.Optimizing distributed systems with remote direct memory access[J].Big Data Research,2018,4(4):3-14.(in Chinese)魏星达,陈榕,陈海波.基于RDMA高速网络的高性能分布式系统[J].大数据,2018,4(4):3-14.
[7] CHAHAL K,GROVER M S,DEY K.A hitchhiker's guide on distributed training of deep neural networks[EB/OL].[2019-02-10].https://arxiv.org/pdf/1810.11787.pdf.
[8] ABADI M,BARHAM P,CHEN J,et al.Tensorflow:a system for large-scale machine learning[C]//Proceedings of the 12th Symposium on Operating Systems Design and Implementation.[S.l.]:USENIX,2016:265-283.
[9] PASZKE A,GROSS S,CHINTALA S,et al.Pytorch:tensors and dynamic neural networks in python with strong GPU acceleration[EB/OL].[2019-02-10].https://github.com/t-vi/pytorch.
[10] CHEN Tianqi,LI Mu,LI Yutian,et al.MXNet:a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL].[2019-02-10].https://arxiv.org/pdf/1512.01274.pdf.
[11] ZHANG Haoshenglun,LI Chong,KE Yong,et al.A distributed user browse click model algorithm[J].Computer Engineering,2019,45(3):1-6.(in Chinese)张浩盛伦,李翀,柯勇,等.一种分布式用户浏览点击模型算法[J].计算机工程,2019,45(3):1-6.
[12] HE X N,CHUA T S.Neural factorization machines for sparse predictive analytics[C]//Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2017:355-364.
[13] CHENG H T,KOC L,HARMSEN J,et al.Wide and deep learning for recommender systems[C]//Proceedings of the 1st Workshop on Deep Learning for Recommender Systems.New York,USA:ACM Press,2016:7-10.
[14] KINGMA D P,WELLING M.Auto-encoding variationalBayes[EB/OL].[2019-02-10].https://arxiv.org/pdf/1312.6114.pdf.
[15] DEAN J,GHEMAWAT S.MapReduce:simplified data processing on large clusters[J].Communications of the ACM,2008,51(1):107-113.
[16] KINGMA D P,BA J.Adam:a method for stochastic optimization[EB/OL].[2019-02-10].https://arxiv.org/pdf/1412.6980~8.pdf.
[17] McMAHAN H B,HOLT G,SCULLEY D,et al.Ad click prediction:a view from the trenches[C]//Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2013:1222-1230.
[18] YOU Y,ZHANG Z,HSIEH C J,et al.Imagenet training in minutes[C]//Proceedings of the 47th International Conference on Parallel Processing.New York,USA:ACM Press,2018:1.
[19] ZHENG Shuxin,MENG Qi,WANG Taifeng,et al.Asynchronous stochastic gradient descent with delay compensation[C]//Proceedings of the 34th International Conference on Machine Learning.Sydney,Australia:[s.n.],2017:4120-4129.
[20] SERGEEV A,DELBALSO M.Horovod:fast and easy distributed deep learning in TensorFlow[EB/OL].[2019-02-10].https://arxiv.org/pdf/1802.05799.pdf.
[21] SONG Kuangshi.LightCTR[EB/OL].[2019-02-10].https://github.com/cnkuangshi/LightCTR.

选择文件类型/文献管理软件名称

选择包含的内容

一个轻量级分布式机器学习系统的设计与实现

Design and Implementation of a Lightweight Distributed Machine Learning System

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	张磊, 赵光岳, 肖超恩, 王建新. Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现[J]. 计算机工程, 2024, 50(9): 208-215.
[2]	杨太龙, 赵红朋, 张磊. 基于国产异构平台的奇异值分解法[J]. 计算机工程, 2024, 50(9): 216-225.
[3]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[4]	杨思捷, 陈俊奇, 王勇, 李树林. 基于FPGA的软硬件协同纠删码编码加速方案[J]. 计算机工程, 2024, 50(2): 224-231.
[5]	王其涵, 庞建民, 岳峰, 祝迪, 沈莉, 肖谦. 面向申威架构的KNN并行算法实现与优化[J]. 计算机工程, 2023, 49(5): 286-294.
[6]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[7]	刘康, 万伟, 刘波, 李俊宏, 李柱. 基于“嵩山”超级计算机的UCX库分析与优化[J]. 计算机工程, 2023, 49(12): 274-281.
[8]	房俊, 薛晓东, 周云亮. 基于深度生成模型的聚合查询区间估计方法[J]. 计算机工程, 2023, 49(11): 284-292, 301.
[9]	丁庆丰, 李晋国. 一种物联网环境下的分布式异常流量检测方案[J]. 计算机工程, 2022, 48(8): 152-159.
[10]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[11]	易培淮, 李卫东, 林韬, 邹佳恒, 邓子艳, 刘言. GPU在缪子快速模拟中的应用[J]. 计算机工程, 2021, 47(8): 100-108.
[12]	张基, 谢在鹏, 毛莺池, 徐媛媛, 朱晓瑞, 李博文. MapReduce框架下结合分布式编码计算的容错算法[J]. 计算机工程, 2021, 47(4): 173-179.
[13]	佘鑫, 何震瀛. 复杂属性条件下基于Spark的clique社区搜索算法[J]. 计算机工程, 2021, 47(12): 54-61,70.
[14]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[15]	刘美佳, 张箐. 基于分布式集群架构的遥感数据传输机制[J]. 计算机工程, 2021, 47(10): 180-185.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

一个轻量级分布式机器学习系统的设计与实现

Design and Implementation of a Lightweight Distributed Machine Learning System

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 15

编辑推荐

Metrics

本文评价