Kubeflow异构算力调度策略研究

doi:10.19678/j.issn.1000-3428.0067396

摘要/Abstract

摘要：

Kubeflow将机器学习和云计算技术两个技术领域相结合，集成了大量的机器学习工具，为生产级的机器学习平台落地提供了可行方案。机器学习通常依托图形处理器（GPU）等专用处理器来提高训练和推理速度，随着云计算集群规模的动态调整，不同计算架构的云计算节点可以灵活地加入/退出集群，传统的轮询调度策略已无法满足动态调整下的异构算力资源调度。为解决Kubeflow平台异构算力的分配优化问题，提高平台资源利用率，实现负载均衡，提出一种基于云的图形处理器-中央处理器（CPU-GPU）异构算力调度策略，采用量化后的负载均衡度和优先级两个判断指标，细颗粒度化显存分配，将计算资源挂载给对应的Pod以实现算力资源的细颗粒度调度。根据集群各节点算力资源设计资源权重矩阵，利用改进的遗传算法获取Pod的最优部署方案，保证多个任务的执行。实验结果表明，该调度策略对并行任务支持效果较好，且在资源请求溢出的情况下，能够按照优先级调度执行并实现最优的负载，与平台原生策略相比，资源细化程度提升了一个数量级，集群负载均衡也有较为显著的提升。

关键词: 云计算, 机器学习, 异构算力, 资源调度, 遗传算法

Abstract:

Kubeflow is a project that integrates machine learning and cloud computing technology, integrating a large number of machine learning tools and providing a feasible solution for the deployment of production-grade machine learning platforms. Machine learning relies on specialized Graphics Processing Unit(GPU)s to improve training and inference speed. As the size of cloud computing clusters is dynamically adjusted, computing nodes of different computing architectures can be added or removed from the cluster, and traditional round-robin scheduling strategies cannot realize the dynamic adjustment of heterogeneous computing power resources. To solve the allocation and optimization problems of Kubeflow's heterogeneous computing power, improve the utilization rate of platform resources, and achieve load balancing, a cloud-based Central Processing Unit-GPU(CPU-GPU) heterogeneous computing power scheduling strategy is proposed. This scheduling strategy adopts two judgment indicators: weighted load balancing degree and priority, and fine-grained allocation of display memory to achieve granularity of computing power resources. The optimal deployment scheme of Pod is designed according to the resource weight matrix of each node in the cluster, and an improved genetic algorithm is used for optimal deployment. The experimental results show that this scheduling strategy performs better for parallel tasks. It can execute optimal loads under overflow of resource requests. Compared with the original platform-native strategy, the degree of resource fine-tuning is one order of magnitude higher, and the cluster load balancing performance is also significantly improved.

Key words: cloud computing, machine learning, heterogeneous computing, resource scheduling, genetic algorithm

孙毅, 王会梅, 鲜明, 向航. Kubeflow异构算力调度策略研究[J]. 计算机工程, 2024, 50(2): 25-32.

Yi SUN, Huimei WANG, Ming XIAN, Hang XIANG. Research on Heterogeneous Computing Scheduling Strategy for Kubeflow[J]. Computer Engineering, 2024, 50(2): 25-32.

http://www.ecice06.com/CN/Y2024/V50/I2/25

图/表 11

图1 Kubeflow中的GPU调度

Fig.1 GPU scheduling in Kubeflow

图2 调度模型

Fig.2 Scheduling model

图3 染色体

Fig.3 Chromosome

图4 编码模型

Fig.4 Coding model

图5 原生策略负载情况

Fig.5 Native policy load situation

图6 集群资源分配

Fig.6 Cluster resource allocation

图7 优选函数值

Fig.7 Preferred function value

参考文献 25

1	林志达, 吴石松. Dockers容器在人工智能研发平台中的关键技术研究. 自动化与仪器仪表, 2020,(6): 192- 196.
	LIN Z D, WU S S. Research on key technologies of Dockers container in artificial intelligence research and development platform. Automation & Instrumentation, 2020,(6): 192- 196.
2	罗晟皓. 基于Docker和Kubernetes的深度学习容器云平台的设计与实现[D]. 北京: 北京交通大学, 2019.
	LUO S H. Design and implementation of deep learning container cloud platform based on Docker and Kubernetes[D]. Beijing: Beijing Jiaotong University, 2019. (in Chinese)
3	Google. Kubernetes[EB/OL]. [2023-02-03]. https://kubernetes.io/docs/concepts/overview/.
4	Google. Kubeflow[EB/OL]. [2023-02-03]. https://www.kubeflow.org/docs/started/introduction/.
5	THINAKARAN P, GUNASEKARAN J R, SHARMA B, et al. Kube-knots: resource harvesting through dynamic container orchestration in GPU-based datacenters[C]//Proceedings of IEEE International Conference on Cluster Computing. Washington D. C., USA: IEEE Press, 2019: 1-13.
6	WANG S Q, GONZALEZ O J, ZHOU X B, et al. An efficient and non-intrusive GPU scheduling framework for deep learning training systems[C]//Proceedings of International Conference on High Performance Computing. New York, USA: ACM Press, 2020: 1-13.
7	GAO C, REN R, CAI H M. GAI: a centralized tree-based scheduler for machine learning workload in large shared clusters[M]. [S. 1.]: Springer International Publishing, 2018.
8	HUA Q, QIAN S Y, YANG D Y, et al. Qore-DL: a QoS-aware joint optimization framework for distributed deep learning training. Journal of Systems Architecture, 2022, 130, 102640. doi: 10.1016/j.sysarc.2022.102640
9	LE T N, SUN X, CHOWDHURY M, et al. AlloX: compute allocation in hybrid clusters[C]//Proceedings of the 15th European Conference on Computer Systems. Berlin, Germany: Springer, 2020: 1-16.
10	SONG S B, DENG L L, GONG J, et al. Gaia scheduler: a kubernetes-based scheduler framework[C]//Proceedings of International Conference on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications. Washington D. C., USA: IEEE Press, 2018: 252-259.
11	ALBAHAR H, DONGARE S, DU Y L, et al. SchedTune: a heterogeneity-aware GPU scheduler for deep learning[C]//Proceedings of the 22nd IEEE International Symposium on Cluster, Cloud and Internet Computing. Washington D. C., USA: IEEE Press, 2022: 695-705.
12	朱紫钰, 汤小春, 赵全. 面向CPU-GPU集群的分布式机器学习资源调度框架研究. 西北工业大学学报, 2021, 39(3): 529- 538.
	ZHU Z Y, TANG X C, ZHAO Q. A unified schedule policy of distributed machine learning framework for CPU-GPU cluster. Journal of Northwestern Polytechnical University, 2021, 39(3): 529- 538.
13	李勋章, 王如月, 莫静容. 改进的遗传算法在云资源调度上的应用. 桂林航天工业学院学报, 2022, 27(1): 9- 13.
	LI X Z, WANG R Y, MO J R. Application of improved genetic algorithm in cloud resource scheduling. Journal of Guilin University of Aerospace Technology, 2022, 27(1): 9- 13.
14	ZHANG F, CHEN Z, ZHANG C Y, et al. An efficient parallel secure machine learning framework on GPUs. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(9): 2262- 2276. doi: 10.1109/TPDS.2021.3059108
15	SHEN W F, LIU Z S, TAN Y J, et al. KubeGPU: efficient sharing and isolation mechanisms for GPU resource management in container cloud. The Journal of Supercomputing, 2023, 79(1): 591- 625. doi: 10.1007/s11227-022-04682-2
16	ZHU X R, GONG L, ZHU Z W, et al. Vapor: a GPU sharing scheduler with communication and computation pipeline for distributed deep learning[C]//Proceedings of IEEE International Conference on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking. New York, USA: ACM Press, 2021: 108-116.
17	CAO Y P, WANG H F. A task scheduling scheme for preventing temperature hotspot on GPU heterogeneous cluster[C]//Proceedings of International Conference on Green Informatics. Washington D. C., USA: IEEE Press, 2017: 117-121.
18	胡程鹏, 薛涛. 基于遗传算法的Kubernetes资源调度算法. 计算机系统应用, 2021, 30(9): 152- 160.
	HU C P, XUE T. Kubernetes resource scheduling algorithm based on genetic algorithm. Computer Systems & Applications, 2021, 30(9): 152- 160.
19	王浩, 王浩枫. 面向CPUs-GPUs系统的OpenCL任务调度框架. 计算机工程与设计, 2022, 43(7): 1955- 1963.
	WANG H, WANG H F. Scheduling framework for OpenCL programs on CPUs-GPUs heterogeneous platforms. Computer Engineering and Design, 2022, 43(7): 1955- 1963.
20	TANG X Y, FU Z J. CPU-GPU utilization aware energy-efficient scheduling algorithm on heterogeneous computing systems. IEEE Access, 2020, 8, 58948- 58958. doi: 10.1109/ACCESS.2020.2982956
21	刘志彬, 黄秋兰, 胡庆宝, 等. Kubernetes异构资源细粒度调度策略的设计与实现. 计算机工程, 2023, 49(2): 31-36, 45. URL
	LIU Z B, HUANG Q L, HU Q B, et al. Design and implementation of fine-grained scheduling strategy for Kubernetes heterogeneous resources. Computer Engineering, 2023, 49(2): 31-36, 45. URL
22	ITURRIAGA S, NESMACHNOW S, LUNA F, et al. A parallel local search in CPU/GPU for scheduling independent tasks on large heterogeneous computing systems. The Journal of Supercomputing, 2015, 71(2): 648- 672. doi: 10.1007/s11227-014-1315-6
23	颜雪松, 伍庆华, 胡成玉. 遗传算法及其应用. 武汉: 中国地质大学出版社, 2018.
	YAN X S, WU Q H, HU C Y. Genetic algorithm and its application. Wuhan: China University of Geosciences Press, 2018.
24	汪民乐, 高晓光, 范阳涛. 先进遗传算法及其工程应用. 西安: 西北工业大学出版社, 2019.
	WANG M L, GAO X G, FAN Y T. Advanced genetic algorithm and its engineering application. Xi'an: Northwestern Polytechnical University Press, 2019.
25	朱会霞. 二进制遗传算法的改进研究. 沈阳: 东北大学出版社, 2018.
	ZHU H X. Research on the improvement of binary genetic algorithm. Shenyang: Northeast University Press, 2018.

[1]	李浩阳, 贺小伟, 王宾, 吴昊, 尤琪. 基于改进Informer的云计算资源负载预测[J]. 计算机工程, 2024, 50(2): 43-50.
[2]	白祉旭, 王衡军. 基于改进遗传算法的对抗样本生成方法[J]. 计算机工程, 2023, 49(5): 139-149.
[3]	陈治旭, 靳雁霞, 芦烨, 杨晶, 刘亚变, 史志儒. 基于子图卷积神经网络的多精度服装建模方法[J]. 计算机工程, 2023, 49(4): 174-181.
[4]	桑永宣, 魏江坡, 王博, 宋莹. 具有边缘缓存机制的混合启发式任务卸载算法[J]. 计算机工程, 2023, 49(4): 149-158.
[5]	刘志彬, 黄秋兰, 胡庆宝, 程耀东, 胡誉, 田浩来. Kubernetes异构资源细粒度调度策略的设计与实现[J]. 计算机工程, 2023, 49(2): 31-36,45.
[6]	王恩旭, 王晓红, 张坤, 张冬雯. 基于双重注意力机制的云计算负载预测模型[J]. 计算机工程, 2023, 49(11): 40-48, 69.
[7]	刘金硕, 詹岱依, 邓娟, 王丽娜. 基于深度神经网络和联邦学习的网络入侵检测[J]. 计算机工程, 2023, 49(1): 15-21,30.
[8]	马华伟, 马凯, 郭君. 考虑多投递的带无人机车辆路径规划问题研究[J]. 计算机工程, 2022, 48(8): 299-305.
[9]	葛昕, 邹福泰, 郭万达, 谭越, 李林森. 社交僵尸网络发展综述[J]. 计算机工程, 2022, 48(8): 12-24.
[10]	俞莎莎, 牛保宁. 基于交易不可信度的比特币非法交易检测[J]. 计算机工程, 2022, 48(8): 166-172.
[11]	黄金瑶, 刘同来, 吴嘉鑫, 武继刚. 多周期家庭护理的路径规划与调度算法[J]. 计算机工程, 2022, 48(7): 292-299.
[12]	奚智雯, 蔡晶晶, 阳文敏, 柴志雷. 基于微服务架构FPGA云平台的并发请求调度机制[J]. 计算机工程, 2022, 48(7): 206-213.
[13]	金海波, 赵欣越. 共形预测框架下的高可靠入侵检测算法[J]. 计算机工程, 2022, 48(7): 130-140.
[14]	钱龙, 赵静, 韩京宇, 毛毅. 基于标签相关性的K近邻多标签学习[J]. 计算机工程, 2022, 48(6): 73-78,88.
[15]	贺小伟, 徐靖杰, 王宾, 吴昊, 张博文. 基于GRU-LSTM组合模型的云计算资源负载预测研究[J]. 计算机工程, 2022, 48(5): 11-17,34.

选择文件类型/文献管理软件名称

选择包含的内容