作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (1): 201-207. doi: 10.19678/j.issn.1000-3428.0054014

• 体系结构与软件技术 • 上一篇    下一篇

一个轻量级分布式机器学习系统的设计与实现

宋匡时1,2, 李翀1, 张士波1   

  1. 1. 中国科学院 计算机网络信息中心, 北京 100190;
    2. 中国科学院大学 计算机科学与技术学院, 北京 100190
  • 收稿日期:2019-02-26 修回日期:2019-04-11 出版日期:2020-01-15 发布日期:2019-05-22
  • 作者简介:宋匡时(1994-),男,硕士研究生,主研方向为分布式系统;李翀,副研究员;张士波,高级工程师。
  • 基金资助:
    中国科学院"十三五"信息化重大专项"中国科学院科研教育态势感知服务"(XXH13504-03)。

Design and Implementation of a Lightweight Distributed Machine Learning System

SONG Kuangshi1,2, LI Chong1, ZHANG Shibo1   

  1. 1. Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China;
    2. College of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2019-02-26 Revised:2019-04-11 Online:2020-01-15 Published:2019-05-22

摘要: 为满足大规模机器学习系统高定制化、低耦合与低资源消耗的需求,设计并实现一个轻量级分布式机器学习系统。采用模块化分层设计并移植多种主流的机器学习与深度学习算法,同时提出参数服务器与动态Ring-AllReduce 2种可扩展梯度同步方案,对算法模型进行并行训练加速。实验结果表明,该系统对于稀疏与稠密模型均有较好的扩展性与稳定性,参数服务器训练可达到与单机相近的准确率与收敛效果,Ring-AllReduce也能在8节点模型上实现相对单节点模型6倍的训练加速。

关键词: 机器学习系统, 分布式系统, 并行计算, 集合通信, 模块化

Abstract: To improve customization and reduce coupling and resource consumption in large-scale machine learning systems,this paper designs and realizes a lightweight distributed machine learning system with high performance and scalability.The system adopts a modular and layered design,and migrates a variety of mainstream machine learning and deep learning algorithms.Two kinds of efficient extensible gradient synchronization schemes,parameter server and Ring All-Reduce,are proposed to perform parallel training acceleration experiments on the mainstream algorithm models.Experimental results show that the system can achieve excellent scalability and stability for both sparse and dense models.Parameter Server training can achieve similar accuracy and convergence performance to those of standalone training.The proposed Ring All-Reduce can achieve a 6x training acceleration on 8-node model compared with single node model.

Key words: machine learning system, distributed system, parallel computing, collective communication, modularity

中图分类号: