作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (12): 274-281. doi: 10.19678/j.issn.1000-3428.0066016

• 开发研究与工程应用 • 上一篇    下一篇

基于“嵩山”超级计算机的UCX库分析与优化

刘康, 万伟*, 刘波, 李俊宏, 李柱   

  1. 郑州大学 计算机与人工智能学院, 郑州 450001
  • 收稿日期:2022-10-18 出版日期:2023-12-15 发布日期:2023-12-14
  • 通讯作者: 万伟
  • 作者简介:

    刘康(1998-), 男, 硕士研究生, 主研方向为高性能计算、高速网络

    刘波, 硕士研究生

    李俊宏, 工程师

    李柱, 工程师

Analysis and Optimization of UCX Based on "Songshan" Supercomputer

Kang LIU, Wei WAN*, Bo LIU, Junhong LI, Zhu LI   

  1. School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China
  • Received:2022-10-18 Online:2023-12-15 Published:2023-12-14
  • Contact: Wei WAN

摘要:

UCX是一个经过生产验证的优化通信框架,适用于当前的高带宽和低延迟高速网络。UCX作为“嵩山”国产高性能计算平台的通信中间件,提高了并行编程模型在InfiniBand(IB)高速互联网络上的开发效率,同时其性能也会直接影响上层应用的通信能力。基于“嵩山”超级计算平台,对平台上的UCX框架进行分析与性能测试,在此过程中归纳IB适配器通信存在的局限性以及UCX在通信传输选择中的不合理性。针对这些问题,根据“嵩山”超级计算平台的网络架构特点,在参数层面进行调优,使得UCX适配“嵩山”平台的Socket Direct架构;在代码层面修改UCX对传输的选择逻辑,使得UCX在选出共享内存传输后不再选择网卡进行传输,从而解决节点内的进程间通信抢占HCA卡资源的问题。同时,修正UCX中KNEM共享内存的带宽设置,使UCX在共享内存CMA和KNEM传输的选择上更加合理。实验结果表明,使用优化后的UCX在100个节点间进行allgather集合通信测试时,相对优化前延迟至多降低80%,节点内alltoall集合通信延迟至多降低70%,gather集合通信延迟至多降低45%。改进后的UCX通信库为“嵩山”超级计算平台上的并行编程模型和应用提供了更好的互联网络支撑,明显提升了平台的集合通信性能。

关键词: UCX框架, 高性能计算, 集合通信, InfiniBand协议, 共享内存, 消息传递接口, 高速网络

Abstract:

Unified Communication X(UCX)is an optimized production proven-communication framework for modern, high-bandwidth, and low-latency networks. As the communication middleware of the "Songshan" supercomputing platform, UCX improves the development efficiency of the parallel programming model on InfiniBand(IB). In addition, the performance of UCX directly affects the communication performance of upper-layer applications. This paper presents an in-depth analysis and performance test of the UCX framework on the Songshan supercomputing platform. The limitations of IB adapter communication and the unreasonable choice of UCX communication transmission are revealed. The UCX framework is adapted to the Sockets Direct architecture of the Songshan platform by tuning the parameters at the level of the identified problems. At the code level, the selection logic of UCX for transmission is modified such that UCX no longer selects the network card for transmission after selecting the shared memory transmission. This addresses the problem of inter-process communication within the node occupying the HCA card resources. The bandwidth setting of KNEM in UCX is also modified such that UCX becomes more reasonable in selecting shared memory CMA and KNEM transfers. The experimental results indicate that the use of the optimized UCX for allgather collective communication testing between 100 nodes reduces the delay by at most 80% compared to before optimization. Further, the alltoall collective communication delay within the node is reduced by at most 70%, and the gather collective communication delay is reduced by at most 45%. The optimized UCX communication library provides better Internet support for parallel programming models and applications on the Songshan supercomputing platform, thereby improving the collective communication performance of the platform.

Key words: UCX framework, high performance computing, collective communications, InfiniBand(IB) protocol, share memory, Message Passing Interface (MPI), high-speed network