面向鲲鹏处理器的HPL-MxP多重lookahead优化

doi:10.19678/j.issn.1000-3428.0068758

计算机工程 ›› 2025, Vol. 51 ›› Issue (8): 354-363. doi: 10.19678/j.issn.1000-3428.0068758

面向鲲鹏处理器的HPL-MxP多重lookahead优化

高昂¹^,², 王银山¹^,²^,*(), 燕雯¹^,², 宋昌成³, 王龙³, 姚二林¹^,²

1. 中国科学院计算技术研究所，北京 100190
2. 中国科学院大学，北京 101408
3. 华为技术有限公司，浙江杭州 310052

收稿日期:2023-11-03 修回日期:2023-12-25 出版日期:2025-08-15 发布日期:2025-08-15
通讯作者: 王银山
基金资助:
中国科学院青年创新促进基金(E345060)

HPL-MxP Multiple lookahead Optimization for Kunpeng Processors

GAO Ang¹^,², WANG Yinshan¹^,²^,*(), YAN Wen¹^,², SONG Changcheng³, WANG Long³, YAO Erlin¹^,²

1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China
2. University of Chinese Academy of Sciences, Beijing 101408, China
3. Huawei Technologies Co., Ltd, Hangzhou 310052, Zhejiang, China

Received:2023-11-03 Revised:2023-12-25 Online:2025-08-15 Published:2025-08-15
Contact: WANG Yinshan

摘要/Abstract

摘要：

HPL-MxP基准测试程序被广泛用于衡量超算在混合精度计算下的计算能力。受制于该程序的并行实现算法，矩阵分块大小(NB)值的选取是一个需要兼顾矩阵乘效率和负载均衡的权衡问题。针对该问题，在鲲鹏920系统上进行优化研究，提出多重lookahead优化策略，采用小NB值进行矩阵分块实现更好的负载均衡，同时通过合并多轮尾矩阵更新提升等效NB值，实现负载均衡与高矩阵乘效率两者兼得的目标。为实现多重lookahead优化方案，重构Panel存储方式，并设计计算与通信细粒度流水线，扩展HPL-MxP源程序接口。在鲲鹏920多节点平台上的单双精度混合测试结果表明，HPL-MxP在多重lookahead优化下可有效解决NB值的权衡问题，且相较单重lookahead策略未产生明显额外开销。

关键词: HPL-MxP基准测试程序, 矩阵分块, 混合精度, 多重lookahead优化策略, Panel存储方式

Abstract:

The HPL-MxP benchmark program is widely used for measuring the computational power of supercomputers in mixed-precision computing. Subject to the parallel implementation algorithm of this program, the selection of the matrix Numerical Block (NB) value of the matrix block size is a tradeoff problem that must consider matrix multiplication efficiency and load balancing. To solve this problem, this paper presents an optimization study on the Kunpeng 920 system and proposes a multi-level lookahead optimization strategy: small NB values are used for matrix chunking to achieve better load balancing, and equivalent NB values are improved by merging multiple rounds of matrix multiplication updates to achieve load balancing and high matrix multiplication efficiency. To realize a multi-level lookahead optimization scheme, this study reconstructs the Panel storage mode, designs a fine-grained computing and communication pipeline, and expands the HPL-MxP source program interface. A single-double precision hybrid test on the Kunpeng 920 multi-node platform shows that HPL-MxP can effectively solve the trade-off problem of NB values under multi-level lookahead optimization and does not incur significant additional overhead compared with the single-level lookahead strategy.

Key words: HPL-MxP benchmark test program, matrix blocking, mixed precision, multi-level lookahead optimization strategy, Panel storage mode

高昂, 王银山, 燕雯, 宋昌成, 王龙, 姚二林. 面向鲲鹏处理器的HPL-MxP多重lookahead优化[J]. 计算机工程, 2025, 51(8): 354-363.

GAO Ang, WANG Yinshan, YAN Wen, SONG Changcheng, WANG Long, YAO Erlin. HPL-MxP Multiple lookahead Optimization for Kunpeng Processors[J]. Computer Engineering, 2025, 51(8): 354-363.

https://www.ecice06.com/CN/Y2025/V51/I8/354

图/表 11

图1 LU递归分解

Fig.1 LU recursive factorization

图2 不同NB值下稠密矩阵乘效率随矩阵规模变化曲线

Fig.2 Variation curves of dense matrix multiplication efficiency with matrix size under different NB values

图3 HPL-MxP计算效率随NB值的变化

Fig.3 The change of HPL-MxP computing efficiency with NB value

图4 3重lookahead

Fig.4 3-level lookahead

图5 源程序中的Panel存储方式

Fig.5 Panel storage mode in the source program

图6 3重lookahead矩阵划分示意图

Fig.6 Schematic drawing of matrix partition with 3-level lookahead

图7 3重lookahead矩阵存储示意图

Fig.7 Schematic drawing of matrix storage with 3-level lookahead

图8 3重lookahead流水线示意图

Fig.8 Schematic drawing of 3-level lookahead pipeline

图9 HPL-MxP在不同等效NB值下的计算效率

Fig.9 Computational efficiency of HPL-MxP under different equivalent NB values

图10 等效NB值为2 048时的计算效率对比

Fig.10 Comparison of computational efficiency when the equivalent NB value is 2 048

图11 单重与多重lookahead策略的效率对比

Fig.11 Efficiency comparison of single and multiple lookahead policies

参考文献 26

1	Wikipedia. Half-precision floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Half-precision_floating-point_format&oldid=1157476282.
2	Wikipedia. Bfloat16 floating-point format[EB/OL]. [2023-10-01]. https://en.wikipedia.org/w/index.php?title=Bfloat16_floating-point_format&oldid=1155660759.
3	黄海峰. 解读"暴力"的AI芯片昇腾910. 通信世界, 2019(24): 22- 23.
	HUANG H F. The AI chip that interprets "violence" rises to 910. Communications World, 2019(24): 22- 23.
4	NVIDIA Corporation. NVIDIA Tesla V100[EB/OL]. [2023-10-01]. https://www.nvidia.com/en-us/data-center/v100/.
5	NVIDIA Developer Team. INT4 for AI inference[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/int4-for-ai-inference/.
6	NVIDIA Developer Team. NVIDIA, Arm, and Intel publish FP8 specification for standardization as an interchange format for AI[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/nvidia-arm-and-intel-publish-fp8-specification-for-standardization-as-an-interchange-format-for-ai/.
7	WANG N G, CHOI J, BRAND D, et al. Training deep neural networks with 8-bit floating point numbers[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1812.08011v1.
8	MICIKEVICIUS P, NARANG S R, ALBEN J, et al. Mixed precision training[EB/OL]. [2023-10-01]. https://arxiv.org/abs/1710.03740v3.
9	NVIDIA Developer Team. Tensor cores: mixed precision scientific computing[EB/OL]. [2023-10-01]. https://developer.nvidia.com/blog/tensor-cores-mixed-precision-scientific-computing/.
10	ANZT H, BOMAN E G, GATES M, et al. Towards use of mixed precision in ECP math libraries[D]. Livermore, USA: Lawrence Livermore National Laboratory, 2021.
11	Netlib Organization. High-performance linpack benchmark[EB/OL]. [2023-10-01]. https://netlib.org/benchmark/hpl.
12	Innovative Computing Laboratory. HPL-AI: high-performance linpack for artificial intelligence[EB/OL]. [2023-10-01]. https://icl.utk.edu/hpl-ai/.
13	HPL-MxP Team. HPL-MxP: high-performance linpack mixed precision benchmark[EB/OL]. [2023-10-01]. https://hpl-mxp.org.
14	KUDO S, NITADORI K, INA T, et al. Implementation and numerical techniques for one EFlop/s HPL-AI benchmark on fugaku[C]//Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems. Washington D. C., USA: IEEE Press, 2020: 256-266.
15	LIN R F, YUAN X H, XUE W, et al. 5 ExaFlop/s HPL-MxP benchmark with linear scalability on the 40-million-core sunway supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. New York, USA: ACM Press, 2023: 536-547.
16	HPL-MxP Team. HPL-MxP benchmark results[EB/OL]. [2023-10-01]. https://hpl-mxp.org/results.md.
17	苏月. 华为鲲鹏920: 一颗勇敢的"芯". 计算机与网络, 2019, 45(21): 72- 73.
	SU Y. Huawei Kunpeng 920: a brave "core". Computer & Network, 2019, 45(21): 72- 73.
18	CARSON E, HIGHAM N J. Accelerating the solution of linear systems by iterative refinement in three precisions. SIAM Journal on Scientific Computing, 2018, 40(2): 817- 847. doi: 10.1137/17M1140819
19	CARSON E, HIGHAM N J. A new analysis of iterative refinement and its application to accurate solution of ill-conditioned sparse linear systems. SIAM Journal on Scientific Computing, 2017, 39(6): 2834- 2856. doi: 10.1137/17M1122918
20	HIGHAM N J, PRANESH S, ZOUNON M. Squeezing a matrix into half precision, with an application to solving linear systems. SIAM Journal on Scientific Computing, 2019, 41(4): 2536- 2551. doi: 10.1137/18M1229511
21	HAIDAR A, TOMOV S, DONGARRA J, et al. Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D. C., USA: IEEE Press, 2018: 603-613.
22	BLANCHARD P, HIGHAM N J, LOPEZ F, et al. Mixed precision block fused multiply-add: error analysis and application to GPU tensor cores. SIAM Journal on Scientific Computing, 2020, 42(3): 124- 141. doi: 10.1137/19M1289546
23	TASI Y H, LUSZCZEK P, DONGARRA J. HPL-AI repository[EB/OL]. [2023-10-01]. https://bitbucket.org/icl/hpl-ai/.
24	NVIDIA Corporation. HPC benchmarks container[EB/OL]. [2023-10-01]. https://catalog.ngc.nvidia.com/orgs/nvidia/containers/hpc-benchmarks.
25	TOMOV S, DONGARRA J. Matrix algebra on GPU and multicore architectures[C]//Proceedings of Workshop on Electronic Structure Calculation Methods Accelerators. Washington D. C., USA: IEEE Press 2010: 5-8.
26	RIKEN Center for Computational Science. HPL-AI project[EB/OL]. [2023-10-01]. https://www.r-ccs.riken.jp/labs/lpnctrt/projects/hpl-ai/index.html.

[1]	陈逸, 刘博生, 徐永祺, 武继刚. 混合精度频域卷积神经网络FPGA加速器设计[J]. 计算机工程, 2023, 49(12): 1-9.
[2]	叶钧超, 徐聪, 黄尧, 柴志雷. 基于FPGA的Izhikevich神经元定制计算方法[J]. 计算机工程, 2023, 49(12): 35-45.
[3]	方玉玲, 陈庆奎. 基于矩阵转换的卷积计算优化方法[J]. 计算机工程, 2019, 45(7): 217-221,228.
[4]	王磊, 张云泉, 刘芳芳, 张先轶. 基于混合精度算法的改进HPL软件包[J]. 计算机工程, 2010, 36(19): 47-49.

选择文件类型/文献管理软件名称

选择包含的内容

面向鲲鹏处理器的HPL-MxP多重lookahead优化

HPL-MxP Multiple lookahead Optimization for Kunpeng Processors

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 26

相关文章 4

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

面向鲲鹏处理器的HPL-MxP多重lookahead优化

HPL-MxP Multiple lookahead Optimization for Kunpeng Processors

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献 26

相关文章 4

编辑推荐

Metrics

本文评价