基于“嵩山”超级计算机的UCX库分析与优化

doi:10.19678/j.issn.1000-3428.0066016

摘要/Abstract

摘要：

UCX是一个经过生产验证的优化通信框架，适用于当前的高带宽和低延迟高速网络。UCX作为“嵩山”国产高性能计算平台的通信中间件，提高了并行编程模型在InfiniBand(IB)高速互联网络上的开发效率，同时其性能也会直接影响上层应用的通信能力。基于“嵩山”超级计算平台，对平台上的UCX框架进行分析与性能测试，在此过程中归纳IB适配器通信存在的局限性以及UCX在通信传输选择中的不合理性。针对这些问题，根据“嵩山”超级计算平台的网络架构特点，在参数层面进行调优，使得UCX适配“嵩山”平台的Socket Direct架构；在代码层面修改UCX对传输的选择逻辑，使得UCX在选出共享内存传输后不再选择网卡进行传输，从而解决节点内的进程间通信抢占HCA卡资源的问题。同时，修正UCX中KNEM共享内存的带宽设置，使UCX在共享内存CMA和KNEM传输的选择上更加合理。实验结果表明，使用优化后的UCX在100个节点间进行allgather集合通信测试时，相对优化前延迟至多降低80%，节点内alltoall集合通信延迟至多降低70%，gather集合通信延迟至多降低45%。改进后的UCX通信库为“嵩山”超级计算平台上的并行编程模型和应用提供了更好的互联网络支撑，明显提升了平台的集合通信性能。

关键词: UCX框架, 高性能计算, 集合通信, InfiniBand协议, 共享内存, 消息传递接口, 高速网络

Abstract:

Unified Communication X(UCX)is an optimized production proven-communication framework for modern, high-bandwidth, and low-latency networks. As the communication middleware of the "Songshan" supercomputing platform, UCX improves the development efficiency of the parallel programming model on InfiniBand(IB). In addition, the performance of UCX directly affects the communication performance of upper-layer applications. This paper presents an in-depth analysis and performance test of the UCX framework on the Songshan supercomputing platform. The limitations of IB adapter communication and the unreasonable choice of UCX communication transmission are revealed. The UCX framework is adapted to the Sockets Direct architecture of the Songshan platform by tuning the parameters at the level of the identified problems. At the code level, the selection logic of UCX for transmission is modified such that UCX no longer selects the network card for transmission after selecting the shared memory transmission. This addresses the problem of inter-process communication within the node occupying the HCA card resources. The bandwidth setting of KNEM in UCX is also modified such that UCX becomes more reasonable in selecting shared memory CMA and KNEM transfers. The experimental results indicate that the use of the optimized UCX for allgather collective communication testing between 100 nodes reduces the delay by at most 80% compared to before optimization. Further, the alltoall collective communication delay within the node is reduced by at most 70%, and the gather collective communication delay is reduced by at most 45%. The optimized UCX communication library provides better Internet support for parallel programming models and applications on the Songshan supercomputing platform, thereby improving the collective communication performance of the platform.

Key words: UCX framework, high performance computing, collective communications, InfiniBand(IB) protocol, share memory, Message Passing Interface (MPI), high-speed network

刘康, 万伟, 刘波, 李俊宏, 李柱. 基于“嵩山”超级计算机的UCX库分析与优化[J]. 计算机工程, 2023, 49(12): 274-281.

Kang LIU, Wei WAN, Bo LIU, Junhong LI, Zhu LI. Analysis and Optimization of UCX Based on "Songshan" Supercomputer[J]. Computer Engineering, 2023, 49(12): 274-281.

http://www.ecice06.com/CN/Y2023/V49/I12/274

图/表 14

图1 InfiniBand软件栈和UCX

Fig.1 InfiniBand software stack and UCX

图2 跨Die传输的对比

Fig.2 Comparison of cross-Die transmission

图3 UCX软件栈结构

Fig.3 UCX software stack structure

图4 优化后的程序流程

Fig.4 Optimized program procedure

图5 节点内alltoall测试结果

Fig.5 Intra-node alltoall test results

图6 节点内点对点带宽测试结果

Fig.6 Intra-node p2p bandwidth test results

图7 2种共享内存通信机制

Fig.7 Two shared memory communication mechanisms

图8 节点内gather测试结果

Fig.8 Intra-node gather test results

图9 优化前后节点内alltoall测试结果

Fig.9 Intra-node alltoall test results before and after optimization

图10 优化前后节点内gather测试结果

Fig.10 Intra-node gather test results before and after optimization

图11 优化前后节点内allreduce测试结果

Fig.11 Intra-node allreduce test results before and after optimization

图12 32节点allgather测试结果

Fig.12 32 nodes allgather test results

图13 100节点allgather测试结果

Fig.13 100 nodes allgather test results

参考文献 27

1	PFISTER G F. An introduction to the InfiniBand architecture [EB/OL]. [2022-09-05]. http://www.diku.dk/hjemmesider/ansatte/vinter/cc/Infinibandchap42.pdf.
2	王知恒. InfiniBand网络协议层软件技术研究[D]. 杭州: 浙江大学, 2021.
	WANG Z H. Research on InfiniBand network protocol layer software technology[D]. Hangzhou: Zhejiang University, 2021. (in Chinese)
3	SHAMIS P, VENKATA M G, LOPEZ M G, et al. UCX: an open source framework for HPC network APIs and beyond[C]//Proceedings of 2015 IEEE Annual Symposium on High-Performance Interconnects. Washington D. C., USA: IEEE Press, 2015: 15-22.
4	ITIGIN Y. GitHub-openucx/UCX: Unified Communication X [EB/OL]. [2022-09-05]. https://github.com/openucx/ucx/releases/tag/v1.9.0.
5	GABRIEL E, FAGG G E, BOSILCA G, et al. Open MPI: goals, concept, and design of a next generation MPI implementation[EB/OL]. [2022-09-05]. https://www.researchgate.net/publication/221597359_Open_MPI_Goals_Concept_and_Design_of_a_Next_Generation_MPI_Implementation.
6	EL-GHAZAWI T, SMITH L. UPC: Unified Parallel C[C]//Proceedings of 2006 ACM/IEEE Conference on Supercomputing. Washington D. C., USA: IEEE Press, 2006: 27-35.
7	ALMASI G. PGAS(Partitioned Global Address Space) languages[EB/OL]. [2022-09-05]. https://link.springer.com/referenceworkentry/10.1007/978-0-387-09766-4_210.
8	谢旻, 张伟, 周恩强, 等. 面向天河互连网络的可扩展通信框架实现技术. 计算机工程与科学, 2020, 42 (10): 1720- 1729. doi: 10.3969/j.issn.1007-130X.2020.10.002
	XIE M, ZHANG W, ZHOU E Q, et al. Implementation of scalable communication framework on TH-express interconnection. Computer Engineering and Science, 2020, 42 (10): 1720- 1729. doi: 10.3969/j.issn.1007-130X.2020.10.002
9	CZARNUL P. Parallel programming for modern high performance computing systems[M]. [S. l. ]: CRC Press, 2018.
10	谢旻, 周恩强, 董勇, 等. 基于天河互连的公共通信接口UCX实现与评估. 计算机应用, 2019, 39 (S1): 113- 118. URL
	XIE M, ZHOU E Q, DONG Y, et al. Implementation and evaluation of UCX communication interface on TH-express interconnection. Journal of Computer Applications, 2019, 39 (S1): 113- 118. URL
11	KONG X, ZHU Y, ZHOU H, et al. Collie: finding performance anomalies in RDMA subsystems [EB/OL]. [2022-09-05]. https://www.usenix.org/system/files/nsdi22-paper-kong.pdf.
12	CZARNUL P, PROFICZ J, DRYPCZEWSKI K. Survey of methodologies, approaches, and challenges in parallel programming using high-performance computing systems[EB/OL]. [2022-09-05]. https://www.hindawi.com/journals/sp/2020/4176794/.
13	MVAPICH. OSU micro-benchmarks 5.5[EB/OL]. [2022-09-05]. https://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.5.tar.gz.
14	PAPADOPOULOU N, ODEN L, BALAJI P. A performance study of UCX over InfiniBand[EB/OL]. [2022-09-05]. https://www.researchgate.net/publication/318409171_A_Performance_Study_of_UCX_over_InfiniBand.
15	LIU P N, GUITART J. Performance characterization of containerization for HPC workloads on InfiniBand clusters: an empirical study. Cluster Computing, 2022, 25 (2): 847- 868. doi: 10.1007/s10586-021-03460-8
16	BAYATPOUR M, GHAZIMIRSAEED S M, XU S L, et al. Design and characterization of InfiniBand hardware tag matching in MPI[EB/OL]. [2022-09-05]. https://ieeexplore.ieee.org/document/9139619.
17	PARK J, YEOM H. Design and implementation of software-based dynamically connected transport[EB/OL]. [2022-09-05]. https://ieeexplore.ieee.org/document/8599533.
18	TAKAGI M, YAMAGUCHI N, GEROFI B, et al. Adaptive transport service selection for MPI with InfiniBand network[C]//Proceedings of the 3rd Workshop on Exascale MPI. New York, USA: ACM Press, 2015: 1-10.
19	MACARTHUR P, LIU Q, RUSSELL R D, et al. An integrated tutorial on InfiniBand, Verbs, and MPI. IEEE Communications Surveys & Tutorials, 2017, 19 (4): 2894- 2926.
20	JING X J, LI H X. Construction and optimization of heterogeneous memory system based on NUMA architecture[EB/OL]. [2022-09-05]. https://ieeexplore.ieee.org/abstract/document/9778754.
21	BURSTEIN I. Nvidia Data center Processing Unit(DPU) architecture[EB/OL]. [2022-09-05]. https://ieeexplore.ieee.org/abstract/document/9567066.
22	MARGOLIN A, BARAK A. RDMA-based library for collective operations in MPI[EB/OL]. [2022-09-05]. https://ieeexplore.ieee.org/document/8955451.
23	XING J, HSU K F, QIU Y, et al. Bedrock: programmable network support for secure RDMA systems[EB/OL]. [2022-09-05]. https://www.usenix.org/system/files/sec22summer_xing.pdf.
24	HOEFLER T, DINAN J, THAKUR R, et al. Remote memory access programming in MPI-3. ACM Transactions on Parallel Computing, 2015, 2 (2): 1- 26.
25	MA T, BOUTEILLER A, BOSILCA G, et al. Impact of kernel-assisted MPI communication over scientific applications: CPMD and FFTW[EB/OL]. [2022-09-05]. https://link.springer.com/chapter/10.1007/978-3-642-24449-0_28.
26	GOGLIN B, MOREAUD S. KNEM: a generic and scalable kernel-assisted intra-node MPI communication framework. Journal of Parallel and Distributed Computing, 2013, 73 (2): 176- 188.
27	VIENNE J. Benefits of cross memory attach for MPI libraries on HPC clusters[C]//Proceedings of 2014 Annual Conference on Extreme Science and Engineering Discovery Environment. New York, USA: ACM Press, 2014: 1-6.

[1]	李博, 黄东强, 贾金芳, 吴利, 王晓英, 黄建强. 基于CPU与GPU的异构模板计算优化研究[J]. 计算机工程, 2023, 49(4): 131-137.
[2]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[3]	方燕飞, 刘齐, 董恩铭, 李雁冰, 过锋, 王谛, 何王全, 漆锋滨. 面向E级超算系统的众核片上存储层次研究[J]. 计算机工程, 2023, 49(12): 10-24.
[4]	杨周凡, 韩林, 李冰洋, 谢景明, 韩璞, 刘勇杰. 基于“嵩山”超级计算机系统的大规模管网仿真[J]. 计算机工程, 2022, 48(9): 155-161.
[5]	刘博阳, 胡舒凯, 施得君, 卢宏生. VTFTR：高维胖树中的无死锁容错路由算法[J]. 计算机工程, 2022, 48(12): 38-44,53.
[6]	建澜涛, 任秀江, 张祯, 石嵩, 黄益明, 张春林. E级高性能计算机的维护故障诊断系统研究[J]. 计算机工程, 2022, 48(12): 24-37.
[7]	王法臻, 崔少辉, 王成. 基于Linux的PXIe可重构仪器设备驱动程序开发[J]. 计算机工程, 2021, 47(4): 166-172.
[8]	宋匡时, 李翀, 张士波. 一个轻量级分布式机器学习系统的设计与实现[J]. 计算机工程, 2020, 46(1): 201-207.
[9]	孙震宇, 石京燕, 孙功星, 杜然, 姜晓巍, 邹佳恒, 谭宏楠. 大规模异构计算集群的双层作业调度系统[J]. 计算机工程, 2020, 46(1): 187-195.
[10]	张雷,支小莉. 基于SDN多播的分布式共享内存研究[J]. 计算机工程, 2018, 44(8): 48-53,60.
[11]	翟金凤,孙立博,鲁凯,林学勇,秦文虎. 基于Counting Bloom Filter的流抽样算法研究[J]. 计算机工程, 2018, 44(8): 273-278.
[12]	魏渐俊,陈良育. 基于GPGPU的大整数矩阵行列式快速准确计算方法[J]. 计算机工程, 2018, 44(3): 47-54.
[13]	陈曦,朱建涛,何晓斌. 一种面向高性能计算的分布式对象存储系统[J]. 计算机工程, 2017, 43(8): 69-73.
[14]	宋庆增,吕华阳,赵雷,王江峰. Xeon Phi协处理器的功耗特征测量与分析[J]. 计算机工程, 2017, 43(6): 313-321.
[15]	陆思羽,王宏伟,张悠慧,杨广文,郑纬民. 面向MPI集合操作的定制化片上网络[J]. 计算机工程, 2017, 43(6): 1-10,18.

选择文件类型/文献管理软件名称

选择包含的内容