Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Method for INT8 Quantized Training in Transformer Engine on Domestic AI Accelerators

  

  • Published:2026-06-05

面向国产加速卡的Transformer Engine INT8量化训练优化方法

Abstract: Low-precision training for large language models helps reduce training cost and improve hardware utilization. However, existing high-efficiency low-precision training frameworks mostly rely on native FP8 hardware support, making them difficult to migrate directly to domestic AI accelerators that lack FP8 execution capability. Therefore, a key challenge is how to reconstruct a low-precision training path suitable for domestic accelerators without relying on dedicated FP8 hardware units, while still maintaining training stability and achieving practical end-to-end performance gains. To address this issue, this paper proposes an INT8 dynamic-quantization-based efficient Transformer Engine training scheme for domestic hardware. The proposed scheme redesigns the original FP8 linear-layer computation flow around the integer matrix multiplication capability already available on domestic accelerators, thereby enabling low-precision pretraining of large language models without dedicated FP8 hardware support. In terms of method design, the proposed scheme preserves the dynamic scaling management principle of Transformer Engine and reconstructs the original FP8-dependent linear-layer computation flow into a cross-precision execution path consisting of dynamic quantization, INT8 matrix multiplication, INT32 accumulation, and fused dequantization recovery. This design maps the most computation-intensive matrix multiplication operations onto the underlying integer compute units. To balance implementation feasibility and execution efficiency, a tensorwise dynamic quantization strategy is adopted, in which activations and weights are scaled online according to the dynamic range of each tensor. Combined with the native support of domestic SIMT accelerators for INT8×INT8 integer matrix multiplication and INT32 accumulation, this design enables the domestic adaptation of the core linear-layer operators in Transformer Engine. Furthermore, to address the problems of activation–gradient scale mismatch, quantization error amplification, and convergence degradation that easily arise in numerically sensitive modules such as the input embedding layer and output layer under uniform INT8 quantization, this paper analyzes the numerical characteristics of these layers from the perspectives of gradient propagation and error propagation, and accordingly proposes a hierarchical precision quantization strategy. Specifically, the input embedding layer and output layer remain in BF16 precision to ensure stable gradient propagation and reliable parameter updates; computation-intensive intermediate modules, including attention projection layers and feed-forward networks, adopt an INT8 low-precision path to fully exploit the throughput of integer compute units; scaling factors and some critical intermediate variables are retained in higher precision to balance numerical stability and practical acceleration. On this basis, the proposed scheme is integrated into the Megatron-lm distributed training framework and validated through multi-model pretraining experiments on domestic accelerators. The experiments evaluate Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L, the last of which is an 8-layer pruned version based on the Mixtral-8x7B architecture. Under the same number of training iterations, the proposed INT8 scheme is compared with the BF16 baseline. The results show that the proposed method maintains training loss curves overall close to those of the BF16 baseline across different models, without obvious oscillation, divergence, or convergence stagnation, indicating that the reconstructed INT8 training path can effectively preserve convergence stability during large-model pretraining. In terms of end-to-end training efficiency, the achieved speedups for Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L are 1.21, 1.16, 1.17, 1.07, 1.20, and 1.12, respectively, demonstrating stable efficiency gains across models of different scales and architectures. Overall, the proposed method effectively reconstructs the low-precision training path of Transformer Engine on domestic accelerators without native FP8 hardware support. Through the coordinated design of dynamic quantization, an INT8 computation path, and a hierarchical precision quantization strategy, the method achieves stable end-to-end acceleration while maintaining convergence stability. The experimental results indicate that, under current hardware conditions, software-level computation-path reconstruction combined with model-structure-aware precision configuration can effectively unlock the potential of integer compute units, providing a practical solution for efficient pretraining of large language models on domestic platforms.

摘要: 面向大语言模型的低精度训练有助于降低训练成本并提升硬件利用率,但现有高效低精度训练框架大多依赖原生FP8硬件支持,难以直接迁移至缺乏FP8执行能力的国产加速卡平台。因此,如何在不依赖专用FP8硬件单元的条件下,重构适配国产加速卡的低精度训练路径,并在保持训练稳定性的同时获得实际的端到端性能收益,成为亟待解决的问题。针对这一问题,本文提出了一种基于INT8动态量化的Transformer Engine高效训练方案。该方案面向国产加速卡已有的整数矩阵乘能力,对原有FP8线性层计算流程进行重新设计,从而在无需专用FP8硬件支持的条件下实现大语言模型的低精度预训练。 在方法设计上,所提方案保留了Transformer Engine的动态尺度管理思想,并将原有依赖FP8硬件支持的线性层计算流程重构为“动态量化—INT8矩阵乘—INT32累加—融合反量化恢复”的跨精度执行通路,使计算最密集的矩阵乘操作能够映射到底层整数算力单元。为兼顾可实现性与执行效率,本文采用tensorwise粒度的动态量化策略,对激活与权重按照张量动态范围进行在线缩放,并结合国产SIMT架构加速卡对INT8×INT8整数矩阵乘和INT32累加的原生支持,实现了Transformer Engine核心线性层算子的国产化重构。进一步地,针对统一INT8量化下输入嵌入层和输出层等数值敏感模块易出现激活与梯度尺度失衡、量化误差放大及收敛退化的问题,本文从梯度传播与误差传播两个角度分析了相关层的数值特性,并据此提出分层精度量化策略:输入嵌入层与输出层保持BF16精度,以保证梯度传播的稳定性和参数更新的可靠性;注意力投影层及前馈网络等中间计算密集模块采用INT8低精度通路,以充分释放整数计算单元的吞吐能力;缩放因子及部分关键中间量采用较高精度保存,以兼顾数值稳定性与实际加速效果。在此基础上,本文将该方案集成至Megatron-lm分布式训练框架,并在国产加速卡上开展多模型预训练验证。 实验选取Llama2-7B、Llama2-13B、Llama3.1-8B、Qwen3-4B、Qwen3-8B以及基于Mixtral-8x7B架构裁剪得到的Mixtral-8x7B-8L模型作为评测对象,在统一训练轮次条件下,对BF16基线与所提INT8方案进行对比分析。结果表明,该方法在不同模型上均能够保持与BF16基线整体接近的训练损失下降趋势,训练过程中未出现明显震荡、发散或收敛停滞,说明重构后的INT8训练路径能够较好保持大模型预训练过程中的收敛稳定性。在端到端训练效率方面,Llama2-7B、Llama2-13B、Llama3.1-8B、Qwen3-4B、Qwen3-8B和Mixtral-8x7B-8L的训练加速比分别达到1.21、1.16、1.17、1.07、1.20和1.12,表明该方法在不同规模和不同结构的大语言模型上均具有较为稳定的效率收益。 综合来看,本文提出的方法在缺乏原生FP8硬件支持的国产加速卡上实现了Transformer Engine低精度训练路径的有效重构。通过动态量化、INT8计算通路和分层精度量化策略的协同设计,该方法在保持训练收敛稳定性的前提下获得了稳定的端到端加速效果。实验结果表明,在现有硬件条件下,基于软件层计算路径重构与模型结构感知的精度配置,能够有效释放整数计算单元潜力,为国产平台上的大模型高效预训练提供可行方案。