GRAPES动力框架中大规模稀疏线性系统并行求解及优化

doi:10.19678/j.issn.1000-3428.0060080

计算机工程 ›› 2022, Vol. 48 ›› Issue (1): 149-154,162. doi: 10.19678/j.issn.1000-3428.0060080

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

张琨¹, 贾金芳¹, 严文昕¹, 黄建强^1,2, 王晓英¹

1. 青海大学计算机技术与应用系, 西宁 810016;
2. 清华大学计算机科学与技术系, 北京 100084

收稿日期:2020-11-23 修回日期:2021-01-17 发布日期:2020-12-28
作者简介:张琨(1997-),男,硕士研究生,主研方向为高性能计算;贾金芳(通信作者),讲师、硕士;严文昕,硕士研究生;黄建强,副教授、博士研究生;王晓英,教授、博士。
基金资助:
国家自然科学基金（61762074，62062059）；青海省科技计划（2019-ZJ-7034）；教育部“春晖计划”科研基金（QDCH2018001）。

Parallel Solution and Optimization of Large-Scale Sparse Linear System in GRAPES Dynamic Framework

ZHANG Kun¹, JIA Jinfang¹, YAN Wenxin¹, HUANG Jianqiang^1,2, WANG Xiaoying¹

1. Department of Computer Technology and Applications, Qinghai University, Xining 810016, China;
2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Received:2020-11-23 Revised:2021-01-17 Published:2020-12-28

摘要/Abstract

摘要： 赫姆霍兹方程求解是GRAPES数值天气预报系统动力框架中的核心部分，可转换为大规模稀疏线性系统的求解问题，但受限于硬件资源和数据规模，其求解效率成为限制系统计算性能提升的瓶颈。分别通过MPI、MPI+OpenMP、CUDA三种并行方式实现求解大规模稀疏线性方程组的广义共轭余差法，并利用不完全分解LU预处理子（ILU）优化系数矩阵的条件数，加快迭代法收敛。在CPU并行方案中，MPI负责进程间粗粒度并行和通信，OpenMP结合共享内存实现进程内部的细粒度并行，而在GPU并行方案中，CUDA模型采用数据传输、访存合并及共享存储器方面的优化措施。实验结果表明，通过预处理优化减少迭代次数对计算性能提升明显，MPI+OpenMP混合并行优化较MPI并行优化性能提高约35%，CUDA并行优化较MPI+OpenMP混合并行优化性能提高约50%，优化性能最佳。

关键词: 稀疏线性系统, 广义共轭余差法, 信息传递接口, OpenMP编程, 统一计算架构

Abstract: The Helmholtz equation is the core of dynamic framework of Global and Regional Assimilation Prediction System(GRAPES) for numerical weather forecast.This equation can essentially be transformed into the solution of a large-scale sparse linear system, but the solution efficiency is limited by hardware resources and scaling data size, and becomes a bottleneck of the system computing performance.This paper explores three parallel methods(MPI, MPI+OpenMP and CUDA) of implementing the Generalized Conjugate Residual(GCR) method for solving large-scale sparse linear equations.At the same time, the ILU preconditioner is used to optimize the number of conditions of the coefficient matrix, which speeds up the convergence of the iterative method.In the CPU parallel scheme, MPI is responsible for coarse-grained parallelism and communication between processes, and OpenMP introduces shared memory to achieve fine-grained parallelism within the process.In the GPU parallel scheme, the CUDA model uses the optimization approaches of data transmission, coalesced access and shared memory.Experimental results show that the performance of MPI+OpenMP hybrid parallel optimization is about 35% higher than that of MPI parallel optimization, and the performance of CUDA parallel optimization is about 50% higher than that of MPI+OpenMP hybrid parallel optimization, which gets the best performance.

Key words: sparse linear system, Generalized Conjugate Residual(GCR) method, Message Passing Interface(MPI), OpenMP programming, Compute Unified Device Architecture(CUDA)

中图分类号:

TP311.1

张琨, 贾金芳, 严文昕, 黄建强, 王晓英. GRAPES动力框架中大规模稀疏线性系统并行求解及优化[J]. 计算机工程, 2022, 48(1): 149-154,162.

ZHANG Kun, JIA Jinfang, YAN Wenxin, HUANG Jianqiang, WANG Xiaoying. Parallel Solution and Optimization of Large-Scale Sparse Linear System in GRAPES Dynamic Framework[J]. Computer Engineering, 2022, 48(1): 149-154,162.

https://www.ecice06.com/CN/Y2022/V48/I1/149

图/表 11

20220108122453

20220108122457

20220108122501

20220108122505

20220108122510

20220108122514

20220108122519

20220108122523

20220108122531

20220108122535

20220108122538

参考文献

[1] AHAMED A K C, MAGOULES F.Iterative methods for sparse linear systems on graphics processing unit[C]//Proceedings of the 14th International Conference on High Performance Computing and Communication and the 9th International Conference on Embedded Software and Systems.Washington D.C., USA:IEEE Press, 2012:836-842.
[2] VARGA R S, GILLIS J.Matrix iterative analysis[M].Berlin, Germany:Springer, 1963.
[3] 翟琰, 翟季冬, 薛巍, 等.大规模并行程序通信性能分析[J].华中科技大学学报(自然科学版), 2011, 39(S1):71-75. ZHAI Y, ZHAI J D, XUE W, et al.Massive parallel program communication performance analysis[J].Journal of Huazhong University of Science and Technology(Natural Science Edition), 2011, 39(S1):71-75.(in Chinese)
[4] 杨磊.通信避免的广义共轭余差算法[D].北京:中国气象科学研究院, 2019. YANG L.Communication avoiding generalized conjugate residual method[D].Beijing:Chinese Academy of Meteorological Sciences, 2019.(in Chinese)
[5] 张理论.面向气象预报数值模式的高效并行计算研究[D].长沙:国防科学技术大学, 2002. ZHANG L L.Research of high-performance parallel computing for numerical models in meteorologic prediction[D].Changsha:National University of Defense Technology, 2002.(in Chinese)
[6] LI X, PENG X, LI X.An improved dynamic core for a non-hydrostatic model system on the Yin-Yang grid[J].Advances in Atmospheric Sciences, 2015, 32(5):648-658.
[7] 蒋沁谷, 金之雁.GRAPES全球模式MPI与OpenMP混合并行方案[J].应用气象学报, 2014, 25(5):581-591. JIANG Q G, JIN Z Y.The hybrid MPI and OpenMP parallel scheme of GRAPES_global model[J].Journal of Applied Meteorological Science, 2014, 25(5):581-591.(in Chinese)
[8] 刘钊.基于国产高性能计算机的GRAPES性能优化研究[D].上海:上海交通大学, 2014.(in Chinese) LIU Z.Study of GRAPES numerical weather prediction system optimization on domestic high performance computers[D].Shanghai:Shanghai Jiaotong University, 2014.(in Chinese)
[9] 伍湘君.GRAPES高分辨率气象数值预报模式并行计算关键技术研究[D].长沙:国防科学技术大学, 2011. WU X J.Study on the parallel computing in GRAPES high resolution numerical weather prediction model[D].Changsha:National University of Defense Technology, 2011.(in Chinese)
[10] HUANG J, XUE W, BIAN H, et al.Helmholtz solving and performance optimization in global/regional assimilation and prediction system[J].Tsinghua Science and Technology, 2021, 26(3):335-346.
[11] YAN W X, JIA J F, HUANG J Q, et al.Research of GRAPES numerical weather prediction model[C]//Proceedings of the 4th High Performance Computing and Cluster Technologies Conference and the 3rd International Conference on Big Data and Artificial Intelligence.New York, USA:ACM Press, 2020:34-41.
[12] ZOU Y, XUE W, LIU S.A case study of large-scale parallel I/O analysis and optimization for numerical weather prediction system[J].Future Generation Computer Systems, 2014, 100(37):378-389.
[13] SHEN X, WANG J, LI Z, et al.Research and operational development of numerical weather prediction in China[J].Journal of Meteorological Research, 2020, 34(4):675-698.
[14] CHEN D, XUE J, YANG X, et al.New generation of multi-scale NWP system(GRAPES):general scientific design[J].Chinese Science Bulletin, 2008, 53(22):3433-3445.
[15] XU G, CHEN D, XUE J, et al.The program structure designing and optimizing tests of GRAPES physics[J].Chinese Science Bulletin, 2008, 53(22):3470-3476.
[16] HUANG B, CHEN D, LI X, et al.Improvement of the semi-Lagrangian advection scheme in the GRAPES model:theoretical analysis and idealized tests[J].Advances in Atmospheric Sciences, 2014, 31(3):693-704.
[17] 陈德辉, 沈学顺.新一代数值预报系统GRAPES研究进展[J].应用气象学报, 2006, 17(6):773-777. CHEN D H, SHEN X S.Recent progress on GRAPES research and application[J].Journal of Applied Meteorological Science, 2006, 17(6):773-777.(in Chinese)
[18] QIAN J H, SEMAZZI F H M, SCROGGS J S.A global nonhydrostatic semi-Lagrangian atmospheric model with orography[J].Monthly Weather Review, 1998, 126(3):747-771.
[19] ROBERT A.A semi-Lagrangian and semi-implicit numerical integration scheme for the primitive meteorological equations[J].Journal of the Meteorological Society of Japan, 1982, 60(1):319-325.
[20] 宋君强, 伍湘君, 张理论, 等.GRAPES模式中Helmhothz方程两种求解方法的对比研究[J].计算机工程与科学, 2011, 33(11):65-70. SONG J Q, WU X J, ZHANG L L, et al.A study of the two Helmholtz solvers in the GRAPES model using GCR and GMRES[J].Computer Engineering and Science, 2011, 33(11):65-70.(in Chinese)
[21] 薛纪善, 陈德辉.数值预报系统GRAPES的科学设计与应用[M].北京:科学出版社, 2008. XUE J S, CHEN D H.Scientific design and application of numerical prediction system GRAPES[M].Beijing:Science Press, 2008.(in Chinese)
[22] SAAD Y.Iterative methods for sparse linear systems[M].[S.l.]:Society for Industrial and Applied Mathematics, 2003.
[23] KREUTZER M, HAGER G, WELLEIN G, et al.A unified sparse matrix data format for efficient general sparse matrix-vector multiplication on modern processors with wide SIMD units[J].SIAM Journal on Scientific Computing, 2014, 36(5):401-423.
[24] ZARDOSHTI P, KHUNJUSH F, SARBAZI-AZAD H.Adaptive sparse matrix representation for efficient matrix-vector multiplication[J].The Journal of Supercomputing, 2016, 72(9):3366-3386.
[25] CHUNDURI S, PARKER S, BALAJI P, et al.Characterization of MPI usage on a production supercomputer[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2018:386-400.
[26] PANDA D K, SUBRAMONI H, CHU C, et al.The MVAPICH project:transforming research into high-performance MPI library for HPC community[J].Journal of Computational Science, 2020, 52(1):1-10.

选择文件类型/文献管理软件名称

选择包含的内容

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

Parallel Solution and Optimization of Large-Scale Sparse Linear System in GRAPES Dynamic Framework

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 6

编辑推荐

Metrics

本文评价

[1]	周雍浩, 徐金龙, 李斌, 钱宏, 聂凯. 面向神威高性能多核处理器的并行编译优化方法[J]. 计算机工程, 2022, 48(9): 130-138.
[2]	易培淮, 李卫东, 林韬, 邹佳恒, 邓子艳, 刘言. GPU在缪子快速模拟中的应用[J]. 计算机工程, 2021, 47(8): 100-108.
[3]	赵宝琦, 李卫东, 邹佳恒, 林韬, 颜田. 基于MPI的分布式数据处理系统[J]. 计算机工程, 2019, 45(7): 20-25.
[4]	李婷,徐云,聂鹏宇,潘玮华. 一种跨平台的并行编程框架设计与实现[J]. 计算机工程, 2014, 40(8): 43-47.
[5]	胡慧丽, 陈庆奎, 庄松林. 基于CUDA的3G视频清晰度评估方法[J]. 计算机工程, 2011, 37(18): 264-265.
[6]	迟利华, 刘杰. 非线性扩散方程的显式并行计算[J]. 计算机工程, 2010, 36(21): 25-27.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

Parallel Solution and Optimization of Large-Scale Sparse Linear System in GRAPES Dynamic Framework

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 6

编辑推荐

Metrics

本文评价