Adaptive Collective Communication Optimization for Domestic Accelerator Platforms

doi:10.19678/j.issn.1000-3428.0260188

Abstract

Abstract: To address the issues of poor adaptability of static strategies, strategy space explosion, and performance jitter in collective communication on domestic GPGPU platforms, this paper proposes an offline automatic tuning, communication strategy optimization, and consolidation method for domestic heterogeneous computing platforms. The proposed method constructs a multidimensional performance space model over communication primitives, message sizes, and node scales, and obtains performance data through systematic offline benchmarking. To mitigate the impact of system noise in heterogeneous environments, a filtering mechanism based on default strategy comparison and significance thresholding is designed. Specifically, the default strategy is first used as a baseline to evaluate performance differences, and statistical analysis is then applied to identify communication strategy combinations with significant performance advantages, thereby enabling communication strategy optimization.Furthermore, an interval-based strategy model is constructed to map discrete sampling points into continuous message size ranges, and the optimized strategy mapping is embedded into the internal decision logic of the RCCL communication library. Experimental results on domestic heterogeneous clusters demonstrate that the proposed method enables automatic and accurate strategy selection without introducing any additional runtime overhead. Compared with default strategies, the average bandwidth utilization of Reduce and AllReduce operations is improved by 22.4% and 24%, respectively. By leveraging offline tuning and strategy consolidation, the proposed approach effectively avoids the overhead and instability caused by dynamic search, and provides an efficient and practical solution for improving communication efficiency and system stability in large-scale distributed training systems.

摘要： 针对国产通用图形处理器（GPGPU）平台集合通信中静态策略适应性差、策略规模膨胀及性能抖动等问题，提出一种面向国产异构算力平台的离线集合通信自动调优与通信策略优化及固化方法。该方法通过对通信原语、消息规模及节点规模构建多维性能空间模型，并结合系统化离线基准测试获取性能数据。在此基础上，为降低异构环境下系统噪声的影响，设计了一种基于默认策略性能对比与显著性阈值判定的筛选机制，先以默认策略为基准进行性能差异评估，再通过统计分析识别具备显著性能优势的通信策略组合，从而实现集合通信过程中的通信策略优化。进一步地，构建基于消息规模区间的策略模型，将离散采样点映射为连续区间，并将优化后的策略映射逻辑集成至RCCL通信库内部决策模块中。实验结果表明，在国产异构集群环境下，该方法无需引入额外运行时开销即可实现通信策略的自动匹配。相较默认策略，规约（Reduce）与全规约（AllReduce）的带宽利用率平均提升分别达到22.4%和24%。该方法通过离线调优与策略固化，有效规避动态搜索带来的开销与稳定性问题，为大规模分布式训练系统提供了一种高效且可工程化的通信优化方案。

WANG Han, LI Shen, DU Xiawei, SHU Yanjun, HU Chen, YU Guo, LIU Yuhai. Adaptive Collective Communication Optimization for Domestic Accelerator Platforms[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0260188.

王晗, 李燊, 杜夏威, 舒燕君, 胡辰, 余果, 刘玉海. 面向国产加速器的集合通信自适应优化研究[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0260188.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0260188

References

[1] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter language models using model parallelism[EB/OL]. [2019-09-17]. https://arxiv.org/abs/1909.08053.
[2] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta, Georgia: IEEE Press: 20.
[3] ZHENG L, LI Z, ZHANG H, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning[C]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA, USA: USENIX Association: 559-78.
[4] 蒙玉功. 基于样本重要性的分布式深度学习通信优化策略 [J]. 现代电子技术, 2025, 48(13): 77-82. MENG Yugong. Communication optimization strategy of distributed deep learning based on sample importance [J]. Modern Electronics Technique, 2025, 48(13): 77-82.
[5] 朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述 [J]. 计算机研究与发展, 2021, 58(01): 98-115. ZHU Hongrui, YUAN Guojun, YAO Chengji, et al. Review of distributed deep learning training networks [J]. Journal of Computer Research and Development, 2021, 58(01): 98-115.
[6] SERGEEV A, DEL BALSO M. Horovod: fast and easy distributed deep learning in TensorFlow[EB/OL]. [2018-02-15]. https://arxiv.org/abs/1802.05799.
[7] WEINGRAM A, LI Y, QI H, et al. xccl: A survey of industry-led collective communication libraries for deep learning [J]. Journal of Computer Science and Technology, 2023, 38(1): 166-95.
[8] JIANG Z, LIN H, ZHONG Y, et al. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs[C]//21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA, USA: USENIX Association, 2024: 745-60.
[9] 李宝嘉, 何春志, 夏寅贲, 等. 星脉网络：面向GPU集群集合通信与集中式路由的协同优化 [J]. 中兴通讯技术, 2025, 31(02): 3-13. LI Baojia, HE Chunzhi, XIA Yinben, et al. Star-pulse network: collaborative optimization of collective communication and centralized routing for GPU clusters [J]. ZTE Technology, 2025, 31(02): 3-13.
[10] WANG G, QIN H, JACOBS S A, et al. Zero++: Extremely efficient collective communication for giant model training[C]//Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview, 2024: 1–15.
[11] CHANG L-W, BAO W, HOU Q, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion[EB/OL]. [2024-06-11]. https://arxiv.org/abs/2406.06858.
[12] CHEN C, LI X, ZHU Q, et al. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. La Jolla, CA, USA: Association for Computing Machinery: 178–91.
[13] DAI L, GONG L, AN Z, et al. Sketch-fusion: A gradient compression method with multi-layer fusion for communication-efficient distributed training [J]. Journal of Parallel and Distributed Computing, 2024, 185(1): 104811.
[14] WANG Z, ZHOU Y, TIAN C, et al. AFNFA: An Approach to Automate NCCL Configuration Exploration[C]//Proceedings of the 7th Asia-Pacific Workshop on Networking. Hong Kong, China: Association for Computing Machinery: 204–5.
[15] NVIDIA. NVIDIA Collective Communication Library (NCCL)[EB/OL]. [2026-01-08]. https://github.com/NVIDIA/nccl.
[16] AMD. ROCm communication collectives library(RCCL)[EB/OL]. [2026-01-08]. https://github.com/ROCm/rccl.
[17] XU G, LE Z, CHEN Y, et al. AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training[C]//22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, 2025: 667-83.
[18] 屈志勇, 王晓光, 周纯葆, 等. 面向国产超算系统的大模型训练优化方法 [J]. 数据与计算发展前沿(中英文), 2025, 7(02): 120-9. QU Zhiyong, WANG Xiaoguang, ZHOU Chunbao, et al. Optimization methods for large model training on domestic supercomputing systems [J]. Frontiers of Data & Computing, 2025, 7(02): 120-129.
[19] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners [J]. Advances in neural information processing systems, 2020, 33(1): 1877- 901.
[20] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, Missouri: Association for Computing Machinery, 2021: 1-15.
[21] BEN-NUN T, HOEFLER T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis [J]. ACM Comput Surv, 2019, 52(4): Article 65.
[22] HU Z A S, SIYUAN AND BONATO, TOMMASO AND JEAUGEY, SYLVAIN AND ALEXANDER, CEDELL AND SPADA, ERIC AND DINAN, JAMES AND HAMMOND, JEFF AND HOEFLER, TORSTEN. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms[C]//Proceedings of the IEEE Symposium on High-Performance Interconnects. Washington, D.C., USA: IEEE Press: 48-59.
[23] ZHAO X, ZHANG Z, WU C. AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive[C]//2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS). Jersey City, New Jersey, USA: IEEE, 2024: 25-35.
[24] COWAN M, MALEKI S, MUSUVATHI M, et al. Gc3: An optimizing compiler for gpu collective communication[EB/OL]. [2022-01-27]. https://arxiv.org/abs/2201.11840
[25] SHAH A, CHIDAMBARAM V, COWAN M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). Boston, MA, USA, 2023: 593-612.
[26] COWAN M, MALEKI S, MUSUVATHI M, et al. Mscclang: Microsoft collective communication language[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. New York, NY, USA: ACM, 2023: 502-14.
[27] WON W, HEO T, RASHIDI S, et al. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale[C]//2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, NC, USA: IEEE, 2023: 283-94.
[28] AMD. rccl-tests: RCCL collective communication benchmark suite[EB/OL]. [2026-01-08]. https://github.com/ROCm/rccl-tests.
[29] HOEFLER T, BELLI R. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Austin, Texas, USA: Association for Computing Machinery: Article 73.
[30] XU X L Y, ZHANG Z, ET AL. Reliable and resilient collective communication library for LLM training and serving[EB/OL]. [20025-12-31]. https://arxiv.org/abs/2512.25059.
[31] CORPORATION N. NVLink: A High-Speed Interconnect for GPUs[EB/OL]. [2026-01-08]. https://www.nvidia.com/en-us/data-center/nvlink/.
[32] JOUPPI N P, YOUNG C, PATIL N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. Toronto, Ontario, Canada: Association for Computing Machinery: 1–12.
[33] KALIA A, KAMINSKY M, ANDERSEN D G. Datacenter RPCs can be general and fast[C]//Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation. Boston, MA, USA: USENIX Association: 1–16.
[34] VERBRAEKEN J, WOLTING M, KATZY J, et al. A Survey on Distributed Machine Learning [J]. ACM Comput Surv, 2020, 53(2): Article 30.
[35] LI X W Y, ZHANG H, ET AL. HetCCL: Accelerating LLM training with heterogeneous GPUs[EB/OL]. [2026-01-30]. https://arxiv.org/abs/2601.22585.

Please choose a citation manager

Content to export