[1] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter language models using model parallelism[EB/OL]. [2019-09-17]. https://arxiv.org/abs/1909.08053.
[2] RAJBHANDARI S, RASLEY J, RUWASE O, et al. ZeRO: memory optimizations toward training trillion parameter models[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Atlanta, Georgia: IEEE Press: 20.
[3] ZHENG L, LI Z, ZHANG H, et al. Alpa: Automating inter-and {Intra-Operator} parallelism for distributed deep learning[C]//16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22). Carlsbad, CA, USA: USENIX Association: 559-78.
[4] 蒙玉功. 基于样本重要性的分布式深度学习通信优化策略 [J]. 现代电子技术, 2025, 48(13): 77-82.
MENG Yugong. Communication optimization strategy of distributed deep learning based on sample importance [J]. Modern Electronics Technique, 2025, 48(13): 77-82.
[5] 朱泓睿, 元国军, 姚成吉, 等. 分布式深度学习训练网络综述 [J]. 计算机研究与发展, 2021, 58(01): 98-115.
ZHU Hongrui, YUAN Guojun, YAO Chengji, et al. Review of distributed deep learning training networks [J]. Journal of Computer Research and Development, 2021, 58(01): 98-115.
[6] SERGEEV A, DEL BALSO M. Horovod: fast and easy distributed deep learning in TensorFlow[EB/OL]. [2018-02-15]. https://arxiv.org/abs/1802.05799.
[7] WEINGRAM A, LI Y, QI H, et al. xccl: A survey of industry-led collective communication libraries for deep learning [J]. Journal of Computer Science and Technology, 2023, 38(1): 166-95.
[8] JIANG Z, LIN H, ZHONG Y, et al. MegaScale: Scaling Large Language Model Training to More Than 10,000 GPUs[C]//21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24). Santa Clara, CA, USA: USENIX Association, 2024: 745-60.
[9] 李宝嘉, 何春志, 夏寅贲, 等. 星脉网络:面向GPU集群集合通信与集中式路由的协同优化 [J]. 中兴通讯技术, 2025, 31(02): 3-13.
LI Baojia, HE Chunzhi, XIA Yinben, et al. Star-pulse network: collaborative optimization of collective communication and centralized routing for GPU clusters [J]. ZTE Technology, 2025, 31(02): 3-13.
[10] WANG G, QIN H, JACOBS S A, et al. Zero++: Extremely efficient collective communication for giant model training[C]//Proceedings of the Twelfth International Conference on Learning Representations (ICLR 2024). Vienna, Austria: OpenReview, 2024: 1–15.
[11] CHANG L-W, BAO W, HOU Q, et al. Flux: Fast software-based communication overlap on gpus through kernel fusion[EB/OL]. [2024-06-11]. https://arxiv.org/abs/2406.06858.
[12] CHEN C, LI X, ZHU Q, et al. Centauri: Enabling Efficient Scheduling for Communication-Computation Overlap in Large Model Training via Communication Partitioning[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. La Jolla, CA, USA: Association for Computing Machinery: 178–91.
[13] DAI L, GONG L, AN Z, et al. Sketch-fusion: A gradient compression method with multi-layer fusion for communication-efficient distributed training [J]. Journal of Parallel and Distributed Computing, 2024, 185(1): 104811.
[14] WANG Z, ZHOU Y, TIAN C, et al. AFNFA: An Approach to Automate NCCL Configuration Exploration[C]//Proceedings of the 7th Asia-Pacific Workshop on Networking. Hong Kong, China: Association for Computing Machinery: 204–5.
[15] NVIDIA. NVIDIA Collective Communication Library (NCCL)[EB/OL]. [2026-01-08]. https://github.com/NVIDIA/nccl.
[16] AMD. ROCm communication collectives library(RCCL)[EB/OL]. [2026-01-08]. https://github.com/ROCm/rccl.
[17] XU G, LE Z, CHEN Y, et al. AutoCCL: Automated Collective Communication Tuning for Accelerating Distributed and Parallel DNN Training[C]//22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25). Philadelphia, PA: USENIX Association, 2025: 667-83.
[18] 屈志勇, 王晓光, 周纯葆, 等. 面向国产超算系统的大模型训练优化方法 [J]. 数据与计算发展前沿(中英文), 2025, 7(02): 120-9.
QU Zhiyong, WANG Xiaoguang, ZHOU Chunbao, et al. Optimization methods for large model training on domestic supercomputing systems [J]. Frontiers of Data & Computing, 2025, 7(02): 120-129.
[19] BROWN T, MANN B, RYDER N, et al. Language models are few-shot learners [J]. Advances in neural information processing systems, 2020, 33(1): 1877- 901.
[20] NARAYANAN D, SHOEYBI M, CASPER J, et al. Efficient large-scale language model training on gpu clusters using megatron-lm[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. St. Louis, Missouri: Association for Computing Machinery, 2021: 1-15.
[21] BEN-NUN T, HOEFLER T. Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis [J]. ACM Comput Surv, 2019, 52(4): Article 65.
[22] HU Z A S, SIYUAN AND BONATO, TOMMASO AND JEAUGEY, SYLVAIN AND ALEXANDER, CEDELL AND SPADA, ERIC AND DINAN, JAMES AND HAMMOND, JEFF AND HOEFLER, TORSTEN. Demystifying NCCL: An In-Depth Analysis of GPU Communication Protocols and Algorithms[C]//Proceedings of the IEEE Symposium on High-Performance Interconnects. Washington, D.C., USA: IEEE Press: 48-59.
[23] ZHAO X, ZHANG Z, WU C. AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive[C]//2024 IEEE 44th International Conference on Distributed Computing Systems (ICDCS). Jersey City, New Jersey, USA: IEEE, 2024: 25-35.
[24] COWAN M, MALEKI S, MUSUVATHI M, et al. Gc3: An optimizing compiler for gpu collective communication[EB/OL]. [2022-01-27]. https://arxiv.org/abs/2201.11840
[25] SHAH A, CHIDAMBARAM V, COWAN M, et al. TACCL: Guiding collective algorithm synthesis using communication sketches[C]//20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). Boston, MA, USA, 2023: 593-612.
[26] COWAN M, MALEKI S, MUSUVATHI M, et al. Mscclang: Microsoft collective communication language[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. New York, NY, USA: ACM, 2023: 502-14.
[27] WON W, HEO T, RASHIDI S, et al. Astra-sim2. 0: Modeling hierarchical networks and disaggregated systems for large-model training at scale[C]//2023 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS). Raleigh, NC, USA: IEEE, 2023: 283-94.
[28] AMD. rccl-tests: RCCL collective communication benchmark suite[EB/OL]. [2026-01-08]. https://github.com/ROCm/rccl-tests.
[29] HOEFLER T, BELLI R. Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Austin, Texas, USA: Association for Computing Machinery: Article 73.
[30] XU X L Y, ZHANG Z, ET AL. Reliable and resilient collective communication library for LLM training and serving[EB/OL]. [20025-12-31]. https://arxiv.org/abs/2512.25059.
[31] CORPORATION N. NVLink: A High-Speed Interconnect for GPUs[EB/OL]. [2026-01-08]. https://www.nvidia.com/en-us/data-center/nvlink/.
[32] JOUPPI N P, YOUNG C, PATIL N, et al. In-Datacenter Performance Analysis of a Tensor Processing Unit[C]//Proceedings of the 44th Annual International Symposium on Computer Architecture. Toronto, Ontario, Canada: Association for Computing Machinery: 1–12.
[33] KALIA A, KAMINSKY M, ANDERSEN D G. Datacenter RPCs can be general and fast[C]//Proceedings of the 16th USENIX Conference on Networked Systems Design and Implementation. Boston, MA, USA: USENIX Association: 1–16.
[34] VERBRAEKEN J, WOLTING M, KATZY J, et al. A Survey on Distributed Machine Learning [J]. ACM Comput Surv, 2020, 53(2): Article 30.
[35] LI X W Y, ZHANG H, ET AL. HetCCL: Accelerating LLM training with heterogeneous GPUs[EB/OL]. [2026-01-30]. https://arxiv.org/abs/2601.22585.
|