作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (1): 149-154,162. doi: 10.19678/j.issn.1000-3428.0060080

• 先进计算与数据处理 • 上一篇    下一篇

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

张琨1, 贾金芳1, 严文昕1, 黄建强1,2, 王晓英1   

  1. 1. 青海大学 计算机技术与应用系, 西宁 810016;
    2. 清华大学 计算机科学与技术系, 北京 100084
  • 收稿日期:2020-11-23 修回日期:2021-01-17 发布日期:2020-12-28
  • 作者简介:张琨(1997-),男,硕士研究生,主研方向为高性能计算;贾金芳(通信作者),讲师、硕士;严文昕,硕士研究生;黄建强,副教授、博士研究生;王晓英,教授、博士。
  • 基金资助:
    国家自然科学基金(61762074,62062059);青海省科技计划(2019-ZJ-7034);教育部“春晖计划”科研基金(QDCH2018001)。

Parallel Solution and Optimization of Large-Scale Sparse Linear System in GRAPES Dynamic Framework

ZHANG Kun1, JIA Jinfang1, YAN Wenxin1, HUANG Jianqiang1,2, WANG Xiaoying1   

  1. 1. Department of Computer Technology and Applications, Qinghai University, Xining 810016, China;
    2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2020-11-23 Revised:2021-01-17 Published:2020-12-28

摘要: 赫姆霍兹方程求解是GRAPES数值天气预报系统动力框架中的核心部分,可转换为大规模稀疏线性系统的求解问题,但受限于硬件资源和数据规模,其求解效率成为限制系统计算性能提升的瓶颈。分别通过MPI、MPI+OpenMP、CUDA三种并行方式实现求解大规模稀疏线性方程组的广义共轭余差法,并利用不完全分解LU预处理子(ILU)优化系数矩阵的条件数,加快迭代法收敛。在CPU并行方案中,MPI负责进程间粗粒度并行和通信,OpenMP结合共享内存实现进程内部的细粒度并行,而在GPU并行方案中,CUDA模型采用数据传输、访存合并及共享存储器方面的优化措施。实验结果表明,通过预处理优化减少迭代次数对计算性能提升明显,MPI+OpenMP混合并行优化较MPI并行优化性能提高约35%,CUDA并行优化较MPI+OpenMP混合并行优化性能提高约50%,优化性能最佳。

关键词: 稀疏线性系统, 广义共轭余差法, 信息传递接口, OpenMP编程, 统一计算架构

Abstract: The Helmholtz equation is the core of dynamic framework of Global and Regional Assimilation Prediction System(GRAPES) for numerical weather forecast.This equation can essentially be transformed into the solution of a large-scale sparse linear system, but the solution efficiency is limited by hardware resources and scaling data size, and becomes a bottleneck of the system computing performance.This paper explores three parallel methods(MPI, MPI+OpenMP and CUDA) of implementing the Generalized Conjugate Residual(GCR) method for solving large-scale sparse linear equations.At the same time, the ILU preconditioner is used to optimize the number of conditions of the coefficient matrix, which speeds up the convergence of the iterative method.In the CPU parallel scheme, MPI is responsible for coarse-grained parallelism and communication between processes, and OpenMP introduces shared memory to achieve fine-grained parallelism within the process.In the GPU parallel scheme, the CUDA model uses the optimization approaches of data transmission, coalesced access and shared memory.Experimental results show that the performance of MPI+OpenMP hybrid parallel optimization is about 35% higher than that of MPI parallel optimization, and the performance of CUDA parallel optimization is about 50% higher than that of MPI+OpenMP hybrid parallel optimization, which gets the best performance.

Key words: sparse linear system, Generalized Conjugate Residual(GCR) method, Message Passing Interface(MPI), OpenMP programming, Compute Unified Device Architecture(CUDA)

中图分类号: