[1] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in neural information processing systems. Red Hook, NY, USA: Curran Associates, Inc, 2017: 5998–6008.
[2] MICIKEVICIUS P, NARANG S, ALBEN J, et al. Mixed precision training[C]//Proceedings of the International Conference on Learning Representations (ICLR 2018). Vancouver, Canada: ICLR, 2018.
[3] HE X, SUN J, CHEN H, et al. Campo:Cost-Aware performance optimization for Mixed-Precision neural network training[C]//2022 USENIX Annual Technical Conference (USENIX ATC 22). Carlsbad, CA, USA: USENIX Association, 2022: 505-18.
[4] KALAMKAR D, MUDIGERE D, MELLEMPUDI N, et al. A study of BFLOAT16 for deep learning training[EB/OL]. [2026-3-24]. https://arxiv.org/abs/1905.12322.
[5] PEREZ S P, ZHANG Y, BRIGGS J, et al. Training and inference of large language models using 8-bit floating point[EB/OL]. [2026-3-24]. https://arxiv.org/abs/2309.17224.
[6] HAN R, DEMMEL J, YOU Y. Auto-precision scaling for distributed deep learning[C]//International Conference on High Performance Computing. Cham, Switzerland: Springer, 2021: 79-97.
[7] LUTZ D R, SAINI A, KROES M, et al. Fused fp8 4-way dot product with scaling and fp32 accumulation[C]//2024 IEEE 31st Symposium on Computer Arithmetic (ARITH). Piscataway, NJ, USA: IEEE, 2024: 40-7.
[8] NARAYAN S, GUPTA A, PAUL M, et al. μnit Scaling: Simple and Scalable FP8 LLM Training[C]//ICML 2025. Honolulu, Hawaii, USA: PMLR, 2025: 45720-36.
[9] ZHANG Y, ZHEN H-L, YUAN M, et al. MOSS: Efficient and Accurate FP8 LLM Training with Microscaling and Automatic Scaling[EB/OL]. [2026-3-24]. https://arxiv.org/pdf/2511.05811.
[10] LIANG G, SHAO J, TANG N, et al. TWEO: Transformers Without Extreme Outliers Enables FP8 Training And Quantization For Dummies[EB/OL]. [2026-3-24]. https://arxiv.org/pdf/2511.23225.
[11] XI H, CHEN Y, ZHAO K, et al. Jetfire: Efficient and accurate transformer pretraining with int8 data flow and per-block quantization[C]//Proceedings of the 41st International Conference on Machine Learning. Vienna, Austria: PMLR, 2024: 54049–63.
[12] ROCK A, UNTETHER A, KHALIL O, et al. Int8 transformers for inference acceleration[C]//36th Conference on Neural Information Processing Systems (NeurIPS). San Diego, California, USA: NeurIPS, 2022.
[13] DETTMERS T, LEWIS M, BELKADA Y, et al. LLM.int8(): 8-bit matrix multiplication for transformers at scale[EB/OL]. [2026-3-24]. https://arxiv.org/abs/2208.07339.
[14] ZHOU Q, GUO S, QU Z, et al. Octo:INT8 training with loss-aware compensation and backward quantization for tiny on-device learning[C]//2021 USENIX Annual Technical Conference (USENIX ATC 21). USA: USENIX Association, 2021: 177-91.
[15] ZHU F, GONG R, YU F, et al. Towards unified int8 training for convolutional neural network[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, Washington, USA: IEEE / CVF, 2020: 1969-79.
[16] ZHAO K, HUANG S, PAN P, et al. Distribution adaptive int8 quantization for training cnns[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Virtually hosted / Los Angeles, California, USA: Association for the Advancement of Artificial Intelligence (AAAI), 2021: 3483-91.
[17] ZHANG P, WEI J, ZHANG J, et al. Accurate int8 training through dynamic block-level fallback[EB/OL]. [2026-3-24]. https://arxiv.org/pdf/2503.08040.
[18] SUN X, CHOI J, CHEN C-Y, et al. Hybrid 8-bit floating point (HFP8) training and inference for deep neural networks[C]//Advances in neural information processing systems. Vancouver, British Columbia, Canada: Curran Associates, Inc. / NeurIPS Foundation, 2019.
[19] 杨春, 张睿尧, 黄泷, 等. 深度神经网络模型量化方法综述 [J]. 工程科学学报, 2023, 45(10): 1613-29.
Yang C, Zhang R Y, Huang L, et al. A survey of quantization methods for deep neural networks [J]. Chinese Journal of Engineering, 2023, 45(10): 1613–1629.
[20] JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. Salt Lake City, Utah, USA: IEEE / Computer Vision Foundation (CVF), 2018: 2704-13.
[21] 何昊天, 周蓓, 郭绍忠, 等. 面向矩阵乘计算的自动混合精度优化 [J]. 计算机科学, 2024, 51(S2): 766-75.
He H T, Zhou B, Guo S Z, et al. Automatic Mixed-Precision Optimization for Matrix Multiplication Computation [J]. Computer Science, 2024, 51(S2): 766–775.
[22] 朱妍. 面向国产异构加速器的Transformer大规模训练优化研究[D]. 郑州:郑州大学, 2023.
Zhu Y. Research on large-scale training optimization of Transformer for domestic heterogeneous accelerator [D]. Zhengzhou: Zhengzhou University, 2023.
[23] MILIC U, VILLA O, BOLOTIN E, et al. Beyond the socket: NUMA-aware GPUs[C]//Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. Cambridge, Massachusetts, USA: ACM / IEEE, 2017: 123-35.
[24] YOUNG V, JALEEL A, BOLOTIN E, et al. Combining HW/SW mechanisms to improve NUMA performance of multi-GPU systems[C]//2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Taipei, Taiwan: IEEE, 2018: 339-51.
[25] SHOEYBI M, PATWARY M, PURI R, et al. Megatron-lm: Training multi-billion parameter language models using model parallelism[EB/OL]. [2026-3-24]. https://arxiv.org/pdf/1909.08053.
[26] TOUVRON H, MARTIN L, STONE K, et al. Llama 2: Open foundation and fine-tuned chat models[EB/OL]. [2026-3-24]. https://arxiv.org/abs/2307.09288.
[27] GRATTAFIORI A, DUBEY A, JAUHRI A, et al. The llama 3 herd of models[EB/OL]. [2026-3-24]. https://arxiv.org/pdf/2407.21783.
[28] YANG A, LI A, YANG B, et al. Qwen3 technical report[EB/OL]. [2026-3-24]. https://arxiv.org/abs/2505.09388.
[29] JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts[EB/OL]. [2026-4-21]. https://arxiv.org/pdf/2401.04088.
[30] KIM J, SEO M, NGUYEN X T. Mixed INT4-INT8 LLM Quantization via Progressive Layerwise Assignment with Dynamic Sensitivity Estimation[C]//2025 IEEE International Symposium on Circuits and Systems (ISCAS). London, United Kingdom: IEEE, 2025: 1-5.
[31] PENG H, WU K, WEI Y, et al. Fp8-lm: Training fp8 large language models[EB/OL]. [2026-3-24]. https://arxiv.org/abs/2310.18313.
|