Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Adaptive Collective Communication Optimization for Domestic Accelerator Platforms

  

  • Published:2026-05-29

面向国产加速器的集合通信自适应优化研究

Abstract: To address the issues of poor adaptability of static strategies, strategy space explosion, and performance jitter in collective communication on domestic GPGPU platforms, this paper proposes an offline automatic tuning, communication strategy optimization, and consolidation method for domestic heterogeneous computing platforms. The proposed method constructs a multidimensional performance space model over communication primitives, message sizes, and node scales, and obtains performance data through systematic offline benchmarking. To mitigate the impact of system noise in heterogeneous environments, a filtering mechanism based on default strategy comparison and significance thresholding is designed. Specifically, the default strategy is first used as a baseline to evaluate performance differences, and statistical analysis is then applied to identify communication strategy combinations with significant performance advantages, thereby enabling communication strategy optimization.Furthermore, an interval-based strategy model is constructed to map discrete sampling points into continuous message size ranges, and the optimized strategy mapping is embedded into the internal decision logic of the RCCL communication library. Experimental results on domestic heterogeneous clusters demonstrate that the proposed method enables automatic and accurate strategy selection without introducing any additional runtime overhead. Compared with default strategies, the average bandwidth utilization of Reduce and AllReduce operations is improved by 22.4% and 24%, respectively. By leveraging offline tuning and strategy consolidation, the proposed approach effectively avoids the overhead and instability caused by dynamic search, and provides an efficient and practical solution for improving communication efficiency and system stability in large-scale distributed training systems.

摘要: 针对国产通用图形处理器(GPGPU)平台集合通信中静态策略适应性差、策略规模膨胀及性能抖动等问题,提出一种面向国产异构算力平台的离线集合通信自动调优与通信策略优化及固化方法。该方法通过对通信原语、消息规模及节点规模构建多维性能空间模型,并结合系统化离线基准测试获取性能数据。在此基础上,为降低异构环境下系统噪声的影响,设计了一种基于默认策略性能对比与显著性阈值判定的筛选机制,先以默认策略为基准进行性能差异评估,再通过统计分析识别具备显著性能优势的通信策略组合,从而实现集合通信过程中的通信策略优化。进一步地,构建基于消息规模区间的策略模型,将离散采样点映射为连续区间,并将优化后的策略映射逻辑集成至RCCL通信库内部决策模块中。实验结果表明,在国产异构集群环境下,该方法无需引入额外运行时开销即可实现通信策略的自动匹配。相较默认策略,规约(Reduce)与全规约(AllReduce)的带宽利用率平均提升分别达到22.4%和24%。该方法通过离线调优与策略固化,有效规避动态搜索带来的开销与稳定性问题,为大规模分布式训练系统提供了一种高效且可工程化的通信优化方案。