面向神威高性能多核处理器的并行编译优化方法

doi:10.19678/j.issn.1000-3428.0062139

计算机工程 ›› 2022, Vol. 48 ›› Issue (9): 130-138. doi: 10.19678/j.issn.1000-3428.0062139

面向神威高性能多核处理器的并行编译优化方法

周雍浩¹, 徐金龙², 李斌¹, 钱宏³, 聂凯²

1. 郑州大学信息工程学院, 郑州 450001;
2. 数学工程与先进计算国家重点实验室, 郑州 450001;
3. 江南计算技术研究所, 江苏无锡 214083

收稿日期:2021-07-20 修回日期:2021-11-07 发布日期:2021-11-11
作者简介:周雍浩(2000—),男,本科生,主研方向为先进编译技术;徐金龙,讲师、博士;李斌,副教授、博士;钱宏,高级工程师、硕士;聂凯,博士研究生。
基金资助:
国家重点研发计划“高性能计算”重点专项（2016YFB0200503）。

Parallel Compilation Optimization Method for Sunway High Performance Multi-Core Processors

ZHOU Yonghao¹, XU Jinlong², LI Bin¹, QIAN Hong³, NIE Kai²

1. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China;
2. State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450001, China;
3. Jiangnan Institute of Computing Technology, Wuxi, Jiangsu 214083, China

Received:2021-07-20 Revised:2021-11-07 Published:2021-11-11

摘要/Abstract

摘要： 在神威高性能多核服务器上，自动并行化编译系统为识别和申明程序中的并行性，产生的OpenMP程序没有经过充分的优化，其采用简单的fork-join模型，存在大量的并行循环嵌套，导致运行效率低。为提升自动并行化编译系统产生的OpenMP程序的运行效率，提出一种并行域重构优化技术。并行域重构技术通过合并程序中的并行域和扩展嵌套循环中的并行域范围，减少OpenMP程序的并行域数目，降低线程组频繁创建和合并等控制开销，将简单fork-join模型的OpenMP程序转换为性能更为高效的单程序多数据模型的OpenMP程序。实验结果表明，在新一代神威高性能多核服务器SW1621平台上，并行域重构技术在NPB3.3-OMP测试集和SPEC OMP2012测试集上的运行效率分别提高了10.77%和7.94%的，可有效提升自动并行化编译系统OpenMP程序的执行效率。

关键词: 神威高性能多核处理器, OpenMP编程, 并行域重构, fork-join模型, 单程序多数据模型

Abstract: In the Sunway high performance multi-core server, the automatic parallelization compiling system produces OpenMP programs that are not sufficiently optimized to identify and assert parallelism in the program.Moreover, the program uses a simple fork-join pattern, which has many parallel loops nested in the program, resulting in poor running efficiency.In this study, a parallel region reconstruction optimization technique is developed to improve the running efficiency of OpenMP programs generated by the automatic parallelization compiling system.Parallel domain reconstruction can reduce the number of parallel domains in OpenMP programs by merging parallel domains in programs and extending the scope of parallel domains in nested loops, reduce the control overhead of frequent creation and merging of thread groups, and transform the OpenMP programs with the simple fork-join model into OpenMP programs with a more efficient Single Program Multi-Data(SPMD) model.The experimental results show that on the new-generation Sunway high-performance multi-core server SW1621 platform, the proposed parallel domain reconstruction technique improves the operating efficiency of the NPB3.3-OMP and SPEC OMP2012 test sets by 10.77% and 7.94%, respectively.Furthermore, the proposed technique provides technical support for improving the execution efficiency of OpenMP programs generated by the automatic parallelization compilation system.

Key words: Sunway high performance multi-core processors, OpenMP programming, parallel region reconstruction, fork-join model, Single Program Multi-Data(SPMD) model

中图分类号:

TP391

周雍浩, 徐金龙, 李斌, 钱宏, 聂凯. 面向神威高性能多核处理器的并行编译优化方法[J]. 计算机工程, 2022, 48(9): 130-138.

ZHOU Yonghao, XU Jinlong, LI Bin, QIAN Hong, NIE Kai. Parallel Compilation Optimization Method for Sunway High Performance Multi-Core Processors[J]. Computer Engineering, 2022, 48(9): 130-138.

https://www.ecice06.com/CN/Y2022/V48/I9/130

图/表 10

20220924175853

20220924175856

20220924175859

20220924175903

20220924175907

20220924175911

20220924175915

20220924175918

20220924175922

20220924175926

参考文献

[1] 刘扬, 王鹏, 杨瑞, 等.基于OpenMP的遥感影像并行ISODATA聚类研究[J].计算机工程, 2016, 42(7):238-243, 250. LIU Y, WANG P, YANG R, et al.Research on parallel ISODATA clustering for remote sensing image based on OpenMP[J].Computer Engineering, 2016, 42(7):238-243, 250.(in Chinese)
[2] TIOTTO E, MAHJOUR B, TSANG W, et al.OpenMP 4.5 compiler optimization for GPU offloading[J].IBM Journal of Research and Development, 2020, 64(3/4):1-14.
[3] NETH B, SCOGLAND T R W, STROUT M M, et al.Unified sequential optimization directives in OpenMP[C]//Proceedings of the 16th International Workshop on OpenMP.Berlin, Germany:Springer, 2020:85-97.
[4] MOSSERI I, ALON L O, HAREL R, et al.ComPar:optimized multi-compiler for automatic OpenMP S2S parallelization[C]//Proceedings of the 16th International Workshop on OpenMP.Berlin, Germany:Springer, 2020:247-262.
[5] 邵雨新, 席静, 张自圃.一种利用全国产化器件启动龙芯3A1000的方法[J].兵工自动化, 2020, 39(7):33-35. SHAO Y X, XI J, ZHANG Z P.Method for starting Loongson 3A1000 by using domestic device[J].Ordnance Industry Automation, 2020, 39(7):33-35.(in Chinese)
[6] SOUZA J D, BECKER P H E, BECK A C S.Improving multitask performance and energy consumption with partial-ISA multicores[J].Journal of Parallel and Distributed Computing, 2021, 153:1-14.
[7] MCINTOSH-SMITH S, DE SUPINSKI B R, KLINKENBERG J.OpenMP:enabling massive node-level parallelism[M].Berlin, Germany:Springer, 2021.
[8] LÖFF J, GRIEBLER D, MENCAGLI G, et al.The NAS parallel benchmarks for evaluating C++ parallel programming frameworks on shared-memory architectures[J].Future Generation Computer Systems, 2021, 125:743-757.
[9] 朱会东, 黄永丽, 宋宝卫.基于CMP的指针数据预取方法[J].计算机工程, 2011, 37(6):71-73. SHU H D, HUANG Y L, SONG B W.Pointer data prefetching method based on CMP[J].Computer Engineering, 2011, 37(6):71-73.(in Chinese)
[10] ONODERA N, IDOMURA Y, HASEGAWA Y, et al.GPU acceleration of multigrid preconditioned conjugate gradient solver on block-structured Cartesian grid[C]//Proceedings of International Conference on High Performance Computing in Asia-Pacific Region.New York, USA:ACM Press, 2021:120-128.
[11] PEREIRA F H, LOPES VERARDI S L, NABETA S I.A fast algebraic multigrid preconditioned conjugate gradient solver[J].Applied Mathematics and Computation, 2006, 179(1):344-351.
[12] PAL S, PATHAK S, RAJASEKARAN S.On speeding-up parallel Jacobi iterations for SVDs[C]//Proceedings of the 18th IEEE International Conference on High Performance Computing and Communications.Washington D.C., USA:IEEE Press, 2016:9-16.
[13] YANG X, MITTAL R.Efficient relaxed-Jacobi smoothers for multigrid on parallel computers[J].Journal of Computational Physics, 2017, 332:135-142.
[14] KUDO S, YAMAMOTO Y, BEČKA M, et al.Performance of the parallel one-sided block Jacobi SVD algorithm on a modern distributed-memory parallel computer[C]//Proceedings of the 11th International Conference on Parallel Processing and Applied Mathematics.Washington D.C., USA:IEEE Press, 2016:594-604.
[15] CERVINI S.System and method for efficiently executing single program multiple data programs:USA, US7904905[P].2011-03-08.
[16] Intel Corporation.Architecture and method for data parallel single program multiple data execution:USA, US20200104139[P].2020-05-10.
[17] SPRENGER S, ZEUCH S, LESER U.Exploiting automatic vectorization to employ SPMD on SIMD registers[C]//Proceedings of the 34th IEEE International Conference on Data Engineering Workshops.Washington D.C., USA:IEEE Press, 2018:90-95.
[18] ZHU W R, CUVILLO J, GAO G R.Performance characteristics of OpenMP language constructs on a many-core-on-a-chip architecture[C]//Proceedings of International Workshop on OpenMP.Berlin, Germany:Springer, 2008:90-95.
[19] STELLE G, MOSES W S, OLIVIER S L, et al.OpenMPIR:implementing OpenMP tasks with tapir[C]//Proceedings of the 4th Workshop on LLVM Compiler Infrastructure in HPC.New York, USA:ACM Press, 2017:1-12.
[20] BOURAOUI H, CASTRILLON J, JERAD C.Comparing dataflow and OpenMP programming for speaker recognition applications[C]//Proceedings of the 10th Workshop on Parallel Programming and Run-Time Management Techniques for Many-core Architectures and Design Tools and Architectures for Multicore Embedded Computing Platforms.Washington D.C., USA:IEEE Press, 2019:1-6.
[21] SCOGLAND T R W, GYLLENHAAL J, KEASLER J, et al.Enabling region merging optimizations in OpenMP[M].Berlin, Germany:Springer, 2015.
[22] ALDINUCCI M, CESARE V, COLONNELLI I, et al.Practical parallelization of scientific applications with OpenMP, OpenACC and MPI[J].Journal of Parallel and Distributed Computing, 2021, 157(11):13-29.
[23] HONGXUE J, DANBING L, XILA L.Parallel efficiency analysis of large increment method based on OpenMP[J].Earth and Environmental Science, 2021, 787(1):012052.
[24] 蔡雨, 孙成国, 杜朝晖, 等.异构HPL算法中CPU端高性能BLAS库优化[J].软件学报, 2021, 32(8):2289-2306. CAI Y, SUN C G, DU Z H, et al.CPU-side high performance BLAS library optimization in heterogeneous HPL algorithm full text replacement[J].Journal of Software, 2021, 32(8):2289-2306.(in Chinese)

选择文件类型/文献管理软件名称

选择包含的内容

面向神威高性能多核处理器的并行编译优化方法

Parallel Compilation Optimization Method for Sunway High Performance Multi-Core Processors

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献

相关文章 3

编辑推荐

Metrics

本文评价

[1]	张琨, 贾金芳, 严文昕, 黄建强, 王晓英. GRAPES动力框架中大规模稀疏线性系统并行求解及优化[J]. 计算机工程, 2022, 48(1): 149-154,162.
[2]	李婷,徐云,聂鹏宇,潘玮华. 一种跨平台的并行编程框架设计与实现[J]. 计算机工程, 2014, 40(8): 43-47.
[3]	迟利华, 刘杰. 非线性扩散方程的显式并行计算[J]. 计算机工程, 2010, 36(21): 25-27.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

面向神威高性能多核处理器的并行编译优化方法

Parallel Compilation Optimization Method for Sunway High Performance Multi-Core Processors

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 10

参考文献

相关文章 3

编辑推荐

Metrics

本文评价