作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (12): 45-53. doi: 10.19678/j.issn.1000-3428.0065567

• 先进计算技术 • 上一篇    下一篇

基于神威·太湖之光的非结构网格计算加速算法

许乐, 安虹, 陈俊仕, 张鹏飞, 武铮   

  1. 中国科学技术大学 计算机科学与技术学院, 合肥 230026
  • 收稿日期:2022-08-22 修回日期:2022-09-28 发布日期:2022-10-24
  • 作者简介:许乐(1997—),男,硕士研究生,主研方向为并行计算;安虹(通信作者),教授、博士、博士生导师;陈俊仕,特任副研究员、博士;张鹏飞,硕士研究生;武铮,博士研究生。
  • 基金资助:
    国家自然科学基金“面向E级计算系统的光滑粒子流体动力学高可扩展并行计算框架”(62102389)。

Unstructured Grid Computing Acceleration Algorithm Based on Sunway TaihuLight

XU Le, AN Hong, CHEN Junshi, ZHANG Pengfei, WU Zheng   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
  • Received:2022-08-22 Revised:2022-09-28 Published:2022-10-24

摘要: 在国产异构众核平台神威·太湖之光上的非结构网格计算具有稀疏存储、离散访存、数据依赖等特点,严重制约了众核处理器的性能发挥。为解决稀疏存储和离散访存问题,提出一种N阶对角染色算法,以有效平衡主从核计算并利用从核将全局访存转化为LDM访问。针对数据依赖造成的计算竞争问题,采用自适应和无依赖的任务划分方法,避免并行计算时的数据冲突。为对处理器架构和非结构网格计算进行优化,采用主核与从核异步并行的方式,差异化使用主从核以充分利用硬件资源,同时,取消处理器提供的寄存器通信机制,降低从核阵列的同步开销同时便于扩展到新一代神威平台。此外,使用计算访存异步重叠技术来充分隐藏访存延迟。利用SpMV、Integration、calcLudsFcc算子进行实验,结果表明,相比主核实现,组合加速算法在不同算例规模下平均取得了10倍的加速效果,加速比最高可达24倍,N阶对角染色算法相比非染色分块算法取得了超过5.8倍的性能加速,有效提升了数据局部性和计算并行度。该算法对有依赖关系的计算冲突算子同样具有良好的加速性能,验证了自适应和无依赖任务划分方法的有效性。

关键词: 神威·太湖之光, 非结构网格, 众核加速, 离散访存, 无依赖任务划分

Abstract: The performance of unstructured grid computing on Sunway TaihuLight, a domestic heterogeneous many-core platform, is limited by sparse storage, discrete memory access, and data dependency.To relieve the sparse storage and discrete memory access problems, this paper proposes an N-order diagonal coloring algorithm, which effectively balances the computing between Management Processing Element (MPE) and Computing Processing Elements (CPEs) and convert global memory access to Local Device Memory (LDM) access using CPEs.To solve the computing competition caused by data dependence, this paper presents an adaptive and independent blocking method to avoid data conflicts in parallel computing.Furthermore, various optimizations are employed to overcome the performance bottlenecks:1.To leverage hardware resources, the authors use asynchronous parallelism between MPE and CPEs.2.To reduce synchronization costs, they avoid register communication, which increases the scalability of the next-generation Sunway platform.3.To hide the memory access latency, the authors overlap memory access with computing.The SpMV, Integration, and calcLudsFcc operations are generally used to verify the validity of the algorithm, and the results show that our algorithm achieves an average speedup of about 10 times and up to 24 times higher than that of the MPE implementation.Moreover, the N-order diagonal coloring algorithm has a 5.8 times higher speedup than that of the non-coloring blocking algorithm, which effectively improves data locality and computational parallelism.The algorithm also has good acceleration performance for dependent conflict operators, which verifies the effectiveness of adaptive and independent task partitioning methods.

Key words: Sunway TaihuLight, unstructured grid, many-core acceleration, discrete memory access, independent task partition

中图分类号: