Zhao Chao, Wen Jin Hui, Yu Guo, Zhao Yan Nan, Du Xia Wei, Hu Chen, Liu Wei, Yin Ze Ming, Liu Yu Hai
Accepted: 2026-06-05
Low-precision training for large language models helps reduce training cost and improve hardware utilization. However, existing high-efficiency low-precision training frameworks mostly rely on native FP8 hardware support, making them difficult to migrate directly to domestic AI accelerators that lack FP8 execution capability. Therefore, a key challenge is how to reconstruct a low-precision training path suitable for domestic accelerators without relying on dedicated FP8 hardware units, while still maintaining training stability and achieving practical end-to-end performance gains. To address this issue, this paper proposes an INT8 dynamic-quantization-based efficient Transformer Engine training scheme for domestic hardware. The proposed scheme redesigns the original FP8 linear-layer computation flow around the integer matrix multiplication capability already available on domestic accelerators, thereby enabling low-precision pretraining of large language models without dedicated FP8 hardware support.
In terms of method design, the proposed scheme preserves the dynamic scaling management principle of Transformer Engine and reconstructs the original FP8-dependent linear-layer computation flow into a cross-precision execution path consisting of dynamic quantization, INT8 matrix multiplication, INT32 accumulation, and fused dequantization recovery. This design maps the most computation-intensive matrix multiplication operations onto the underlying integer compute units. To balance implementation feasibility and execution efficiency, a tensorwise dynamic quantization strategy is adopted, in which activations and weights are scaled online according to the dynamic range of each tensor. Combined with the native support of domestic SIMT accelerators for INT8×INT8 integer matrix multiplication and INT32 accumulation, this design enables the domestic adaptation of the core linear-layer operators in Transformer Engine. Furthermore, to address the problems of activation–gradient scale mismatch, quantization error amplification, and convergence degradation that easily arise in numerically sensitive modules such as the input embedding layer and output layer under uniform INT8 quantization, this paper analyzes the numerical characteristics of these layers from the perspectives of gradient propagation and error propagation, and accordingly proposes a hierarchical precision quantization strategy. Specifically, the input embedding layer and output layer remain in BF16 precision to ensure stable gradient propagation and reliable parameter updates; computation-intensive intermediate modules, including attention projection layers and feed-forward networks, adopt an INT8 low-precision path to fully exploit the throughput of integer compute units; scaling factors and some critical intermediate variables are retained in higher precision to balance numerical stability and practical acceleration. On this basis, the proposed scheme is integrated into the Megatron-lm distributed training framework and validated through multi-model pretraining experiments on domestic accelerators.
The experiments evaluate Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L, the last of which is an 8-layer pruned version based on the Mixtral-8x7B architecture. Under the same number of training iterations, the proposed INT8 scheme is compared with the BF16 baseline. The results show that the proposed method maintains training loss curves overall close to those of the BF16 baseline across different models, without obvious oscillation, divergence, or convergence stagnation, indicating that the reconstructed INT8 training path can effectively preserve convergence stability during large-model pretraining. In terms of end-to-end training efficiency, the achieved speedups for Llama2-7B, Llama2-13B, Llama3.1-8B, Qwen3-4B, Qwen3-8B, and Mixtral-8x7B-8L are 1.21, 1.16, 1.17, 1.07, 1.20, and 1.12, respectively, demonstrating stable efficiency gains across models of different scales and architectures.
Overall, the proposed method effectively reconstructs the low-precision training path of Transformer Engine on domestic accelerators without native FP8 hardware support. Through the coordinated design of dynamic quantization, an INT8 computation path, and a hierarchical precision quantization strategy, the method achieves stable end-to-end acceleration while maintaining convergence stability. The experimental results indicate that, under current hardware conditions, software-level computation-path reconstruction combined with model-structure-aware precision configuration can effectively unlock the potential of integer compute units, providing a practical solution for efficient pretraining of large language models on domestic platforms.