基于神威·太湖之光的非结构网格计算加速算法

doi:10.19678/j.issn.1000-3428.0065567

摘要/Abstract

摘要： 在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点，严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题，提出一种N阶对角染色算法，以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题，采用自适应和无依赖的任务划分方法，避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化，采用主核与从核异步并行的方式，差异化使用主从核以充分利用硬件资源，同时，取消处理器提供的寄存器通信机制，降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外，使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验，结果表明，相比主核实现，组合加速算法在不同算例规模下平均取得了10倍的加速效果，加速比最高可达24倍，N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速，有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能，验证了自适应和无依赖任务划分方法的有效性。

关键词: 神威·太湖之光, 非结构网格, 众核加速, 离散访存, 无依赖任务划分

Abstract: The performance of unstructured grid computing on Sunway TaihuLight, a domestic heterogeneous many-core platform, is limited by sparse storage, discrete memory access, and data dependency.To relieve the sparse storage and discrete memory access problems, this paper proposes an N-order diagonal coloring algorithm, which effectively balances the computing between Management Processing Element (MPE) and Computing Processing Elements (CPEs) and convert global memory access to Local Device Memory (LDM) access using CPEs.To solve the computing competition caused by data dependence, this paper presents an adaptive and independent blocking method to avoid data conflicts in parallel computing.Furthermore, various optimizations are employed to overcome the performance bottlenecks:1.To leverage hardware resources, the authors use asynchronous parallelism between MPE and CPEs.2.To reduce synchronization costs, they avoid register communication, which increases the scalability of the next-generation Sunway platform.3.To hide the memory access latency, the authors overlap memory access with computing.The SpMV, Integration, and calcLudsFcc operations are generally used to verify the validity of the algorithm, and the results show that our algorithm achieves an average speedup of about 10 times and up to 24 times higher than that of the MPE implementation.Moreover, the N-order diagonal coloring algorithm has a 5.8 times higher speedup than that of the non-coloring blocking algorithm, which effectively improves data locality and computational parallelism.The algorithm also has good acceleration performance for dependent conflict operators, which verifies the effectiveness of adaptive and independent task partitioning methods.

Key words: Sunway TaihuLight, unstructured grid, many-core acceleration, discrete memory access, independent task partition

中图分类号:

TP311

许乐, 安虹, 陈俊仕, 张鹏飞, 武铮. 基于神威·太湖之光的非结构网格计算加速算法[J]. 计算机工程, 2022, 48(12): 45-53.

XU Le, AN Hong, CHEN Junshi, ZHANG Pengfei, WU Zheng. Unstructured Grid Computing Acceleration Algorithm Based on Sunway TaihuLight[J]. Computer Engineering, 2022, 48(12): 45-53.

https://www.ecice06.com/CN/Y2022/V48/I12/45

图/表 19

20230112182412

20230112182416

20230112182421

20230112182425

20230112182429

20230112182432

20230112182436

20230112182440

20230112182443

20230112182447

20230112182451

20230112182455

20230112182459

20230112182503

20230112182506

20230112182510

20230112182514

20230112182519

20230112182523

参考文献

[1] FU H H, LIAO J F, YANG J Z, et al.The Sunway TaihuLight supercomputer:system and applications[J].Science China Information Sciences, 2016, 59(7):1-16.
[2] 胡向东, 柯希明, 尹飞, 等.高性能众核处理器申威26010[J].计算机研究与发展, 2021, 58(6):1155-1165. HU X D, KE X M, YIN F, et al.Shenwei-26010:a high-performance many-core processor[J].Journal of Computer Research and Development, 2021, 58(6):1155-1165.(in Chinese)
[3] Fluent.Fluent 6.2 user's guide[EB/OL].[2022-07-05].https://www.cfd-online.com/Forums/fluent/36245-fluent-6-2users-guide.html.
[4] FRINK N T.Tetrahedral unstructured navier-stokes method for turbulent flows[J].AIAA Journal, 1998, 36(11):1975-1982.
[5] FRINK N T.Upwind scheme for solving the Euler equations on unstructured tetrahedral meshes[J].AIAA Journal, 1992, 30(1):70-77.
[6] ANDERSON W K, BONHAUS D L.An implicit upwind algorithm for computing turbulent flows on unstructured grids[J].Computers & Fluids, 1994, 23(1):1-21.
[7] NIELSEN E J.Aerodynamic design sensitivities on an unstructured mesh using the Navier-Stokes equations and a discrete adjoint formulation[EB/OL].[2022-07-05].https://theses.lib.vt.edu/theses/available/etd-110498-110349/unrestricted/thesis.pdf.
[8] GERHOLD T, FRIEDRICH O, EVANS J, et al.Calculation of complex three-dimensional configurations employing the DLR-tau-code[J].AIAA Journal, 1997, 16(1):67-81.
[9] ANGELINI R C, SAHU J.Visualization techniques of a CFD++ data set of a spinning smart munition[EB/OL].[2022-07-05].https://apps.dtic.mil/sti/pdfs/ADA428396.pdf.
[10] MAVRIPLIS D J.Third drag prediction workshop results using the NSU3D unstructured mesh solver[J].Journal of Aircraft, 2008, 45(3):750-761.
[11] JASAK H, JEMCOV A, TUKOVIC Z.OpenFOAM:a C++ library for complex physics simulations[EB/OL].[2022-07-05].https://www.researchgate.net/publication/228879492_OpenFOAM_A_C_library_for_complex_physics_simulations.
[12] TUREK S, BECKER C.Featflow-finite element software for the incompressible Navier-Stokes equations[EB/OL].[2022-07-05].https://www.semanticscholar.org/paper/FEATFLOW-Finite-element-software-for-the-equations-Turek-Becker/90aff87e5bec2b1e3ad3d3356a1da617a3e28059.
[13] POPINET S.Gerris:a tree-based adaptive solver for the incompressible Euler equations in complex geometries[J].Journal of Computational Physics, 2003, 190(2):572-600.
[14] FRÉDÉRIC A, NAMANE M, MARC S.Code saturne:a finite volume code for the computation of turbulent incompressible flows-industrial applications[J].International Journal on Finite Volumes, 2004, 1(1):1-62.
[15] BOLZ J, FARMER I, GRINSPUN E, et al.Sparse matrix solvers on the GPU[J].ACM Transactions on Graphics, 2003, 22(3):917-924.
[16] BELL N, GARLAND M.Implementing sparse matrix-vector multiplication on throughput-oriented processors[C]//Proceedings of Conference for High Performance Computing Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2009:1-11.
[17] VÁZQUEZ F, FERNÁNDEZ J J, GARZÓN E M.A new approach for sparse matrix vector product on NVIDIA GPUs[J].Concurrency and Computation:Practice and Experience, 2011, 23(8):815-826.
[18] MONAKOV A, LOKHMOTOV A, AVETISYAN A.Automatically tuning sparse matrix-vector multiplication for GPU architectures[C]//Proceedings of International Conference on High-Performance Embedded Architectures and Compilers.Berlin, Germany:Springer, 2010:111-125.
[19] CHOI J W, SINGH A, VUDUC R W.Model-driven autotuning of sparse matrix-vector multiply on GPUs[J].ACM SIGPLAN Notices, 2010, 45(5):115-126.
[20] KOZA Z, MATYKA M, SZKODA S, et al.Compressed multirow storage format for sparse matrices on graphics processing units[J].SIAM Journal on Scientific Computing, 2014, 36(2):219-239.
[21] GREATHOUSE J L, DAGA M.Efficient sparse matrix-vector multiplication on GPUs using the CSR storage format[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2014:769-780.
[22] ASHARI A, SEDAGHATI N, EISENLOHR J, et al.Fast sparse matrix-vector multiplication on GPUs for graph applications[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2014:781-792.
[23] MERRILL D, GARLAND M.Merge-based parallel sparse matrix-vector multiplication[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2016:1-12.
[24] LIU W F, VINTER B.CSR5:an efficient storage format for cross-platform sparse matrix-vector multiplication[C]//Proceedings of the 29th ACM International Conference on Supercomputing.New York, USA:ACM Press, 2015:339-350.
[25] BULUÇ A, FINEMAN J T, FRIGO M, et al.Parallel sparse matrix-vector and matrix-transpose-vector multiplication using compressed sparse blocks[C]//Proceedings of the 21st Annual Symposium on Parallelism in Algorithms and Architectures.New York, USA:ACM Press, 2009:233-244.
[26] ASHARI A, SEDAGHATI N, EISENLOHR J, et al.An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs[C]//Proceedings of the 28th ACM International Conference on Supercomputing.New York, USA:ACM Press, 2014:15-26.
[27] LIANG Y, TANG W T, ZHAO R Z, et al.Scale-free sparse matrix-vector multiplication on many-core architectures[J].IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2017, 36(12):2106-2119.
[28] YAN S, LI C, ZHANG Y, et al.yaSpMV:yet another SpMV framework on GPUs[J].ACM SIGPLAN Notices, 2014, 49(8):107-118.
[29] 刘芳芳, 杨超, 袁欣辉, 等.面向国产申威26010众核处理器的SpMV实现与优化[J].软件学报, 2018, 29(12):3921-3932. LIU F F, YANG C, YUAN X H, et al.General SpMV implementation in many-core domestic Sunway 26010 processor[J].Journal of Software, 2018, 29(12):3921-3932.(in Chinese)
[30] LIU C X, XIE B W, LIU X, et al.Towards efficient SpMV on Sunway manycore architectures[C]//Proceedings of 2018 International Conference on Supercomputing.Washington D.C., USA:IEEE Press, 2018:363-373.
[31] 倪鸿, 刘鑫.基于神威·太湖之光的非结构网格众核优化技术[J].计算机工程, 2019, 45(6):45-51. NI H, LIU X.Multi-core optimization technology of unstructured grid based on Sunway TaihuLight[J].Computer Engineering, 2019, 45(6):45-51.(in Chinese)
[32] 倪鸿, 刘鑫.非结构网格下稀疏下三角方程求解器众核优化技术研究[J].计算机科学, 2019, 46(S1):518-522. NI H, LIU X.Many-core optimization for sparse triangular solver under unstructured grids[J].Computer Science, 2019, 46(S1):518-522.(in Chinese)
[33] CHEN Y D, XIAO G Q, WU F, et al.tpSpMV:a two-phase large-scale sparse matrix-vector multiplication kernel for manycore architectures[J].Information Sciences, 2020, 523:279-295.

选择文件类型/文献管理软件名称

选择包含的内容