作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (6): 35-47. doi: 10.19678/j.issn.1000-3428.0068234

• 热点与综述 • 上一篇    下一篇

模板运算代码的自动生成与调优框架

刘金硕, 文尧   

  1. 武汉大学国家网络安全学院空天信息安全与可信计算教育部重点实验室, 湖北 武汉 430072
  • 收稿日期:2023-08-16 修回日期:2023-11-10 发布日期:2024-06-11
  • 通讯作者: 文尧,E-mail:188006400@qq.com E-mail:188006400@qq.com
  • 基金资助:
    国家重点研发计划(2020YFA0607900)。

Auto-Generation and Auto-Tuning Framework of Stencil Operation Code

LIU Jinshuo, WEN Yao   

  1. Key Laboratory of Aerospace Information Security and Trusted Computing, Ministry of Education, School of Cyber Science and Engineering, Wuhan University, Wuhan 430072, Hubei, China
  • Received:2023-08-16 Revised:2023-11-10 Published:2024-06-11

摘要: 针对现有模板代码生成方法不支持多图形处理器(GPU)、调优不充分等问题,提出一种由领域专用语言(DSL)描述的模板代码的自动生成与调优框架。在代码自动生成阶段,该框架能够自动解析上层提供的描述语言,构建计算图进而生成模板运算的统一计算设备架构(CUDA)核函数,同时根据单GPU或多GPU环境生成不同的主机端代码。在代码调优阶段,根据不同的GPU型号确定候选参数范围,动态调用生成的CUDA核函数以确定最优参数。在多GPU的情况下,自动生成的主机端代码能够使用计算与通信重叠的方法进行边界数据交换。在4种不同的GPU与7、13、19、27点模板运算中,该框架能找到最优的参数配置。实验结果表明,对于Tesla V100-SXM2,以调优过的参数进行模板运算,该框架在单精度4种模板运算下的每秒万亿次浮点运算数(TFLOPs)分别为1.230、1.680、1.120、1.480,在双精度下分别为0.690、1.010、0.480、1.470,平均性能达到手工优化代码的98%,并且描述更简单,支持多GPU扩展。

关键词: 模板运算, 统一计算设备架构, 计算图, 领域专用语言, 代码生成, 自动调优

Abstract: To address existing issues such as the limitations of current stencil code generation methods in supporting multi-Graphic Processing Unit (GPU) and insufficient optimization, this study proposes a framework for the automatic generation and optimization of stencil code using Domain Specific Language (DSL). In the code generation stage, the framework automatically parses the provided higher-level descriptive language, constructs computational graphs, and generates stencil operation Unified Compute Device Architecture (CUDA) kernel functions. It also produces different host-side code based on whether a single-GPU or multi-GPU environment is used. During the code optimization stage, candidate parameter ranges are determined according to different GPU models, and they are dynamically invoked by generated CUDA kernel functions to ascertain the optimal parameters. For multi-GPU, the automatically generated host-side code can utilize overlapping computation and communication methods for boundary data exchange. Across four different GPUs and stencil operations with 7-, 13-, 19-, and 27-point configurations, the framework successfully identifies the optimal parameter configuration. Experimental results on the Tesla V100-SXM2 show that with optimized parameters for stencil operations, the framework achieves Trillion Floating-point Operations Per Second (TFLOPs) of 1.230, 1.680, 1.120, and 1.480, respectively, in single precision for the four stencil operations, and 0.690, 1.010, 0.480, and 1.470, respectively, in double precision, with an average performance reaching 98% of hand-optimized code. Additionally, it offers a simpler description and supports multi-GPU extension.

Key words: stencil operation, Compute Unified Device Architecture(CUDA), computational graph, Domain Specific Language(DSL), code generation, automatic tuning

中图分类号: