计算机工程

• •    

GRAPES动力框架中大规模稀疏线性系统并行求解及优化

  

  • 发布日期:2020-12-28

Parallel solution and optimization of large-scale sparse linear system in GRAPES dynamic framework

  • Published:2020-12-28

摘要: GRAPES(global/regional assimilation prediction system)数值天气预报系统动力框架中的求解核心之一为赫姆霍兹方程, 该方程本质上可以转换为大规模稀疏线性系统的求解问题。由于硬件资源的限制及数据规模的不断增大,求解效率逐渐成为 了限制计算性能提升的瓶颈。针对这个问题,通过 MPI、MPI+OpenMP、CUDA 三种并行方式实现了求解线性方程组的广义 共轭余差法(GCR)。实验首先通过 ILU 预处理子改善系数矩阵的条件数以加快迭代法收敛。然后在 CPU 并行方案中,MPI 负 责进程间粗粒度并行和通信,OpenMP 结合共享内存实现进程内部的细粒度并行;在 GPU 并行方案中,CUDA 模型采用了数 据传输、访存合并及共享存储器方面的优化措施。最后对结果进行了准确性验证及性能分析。实验结果表明,MPI+OpenMP 混合并行优化比 MPI 并行优化性能高约 35%,CUDA 并行优化比 MPI+OpenMP 混合并行优化性能高约 50%。

Abstract: One of the core solutions in the dynamic framework of GRAPES (global/regional assimilation prediction system) numerical weather prediction system is the Helmholtz equation. This equation can essentially be transformed into the solution of a large-scale sparse linear system. Due to the limitation of hardware resources and the continuous increase of data scale, solution efficiency has gradually become a bottleneck that limits the improvement of computing performance. To solve this problem, the generalized conjugate residual method (GCR) for solving linear equations is realized through three parallel methods of MPI, MPI+OpenMP and CUDA. The experiment first improves the condition number of the coefficient matrix through the ILU preconditioner to speed up the convergence of the iterative method. Then in the CPU parallel scheme, MPI is responsible for coarse-grained parallelism and communication between processes, and OpenMP combines with shared memory to achieve fine-grained parallelism within the process; In the GPU parallel scheme, the CUDA model uses the optimization of data transmission, coalesced access and shared memory. Finally, the accuracy verification and performance analysis of the results are carried out. Experimental results show that the performance of MPI+OpenMP hybrid parallel optimization is about 35% higher than that of MPI parallel optimization, and the performance of CUDAparallel optimization is about 50% higher than that of MPI+OpenMP hybrid parallel optimization.