模板运算代码的自动生成与调优框架

doi:10.19678/j.issn.1000-3428.0068234

摘要/Abstract

摘要： 针对现有模板代码生成方法不支持多图形处理器(GPU)、调优不充分等问题,提出一种由领域专用语言(DSL)描述的模板代码的自动生成与调优框架。在代码自动生成阶段,该框架能够自动解析上层提供的描述语言,构建计算图进而生成模板运算的统一计算设备架构(CUDA)核函数,同时根据单GPU或多GPU环境生成不同的主机端代码。在代码调优阶段,根据不同的GPU型号确定候选参数范围,动态调用生成的CUDA核函数以确定最优参数。在多GPU的情况下,自动生成的主机端代码能够使用计算与通信重叠的方法进行边界数据交换。在4种不同的GPU与7、13、19、27点模板运算中,该框架能找到最优的参数配置。实验结果表明,对于Tesla V100-SXM2,以调优过的参数进行模板运算,该框架在单精度4种模板运算下的每秒万亿次浮点运算数(TFLOPs)分别为1.230、1.680、1.120、1.480,在双精度下分别为0.690、1.010、0.480、1.470,平均性能达到手工优化代码的98%,并且描述更简单,支持多GPU扩展。

关键词: 模板运算, 统一计算设备架构, 计算图, 领域专用语言, 代码生成, 自动调优

Abstract: To address existing issues such as the limitations of current stencil code generation methods in supporting multi-Graphic Processing Unit (GPU) and insufficient optimization, this study proposes a framework for the automatic generation and optimization of stencil code using Domain Specific Language (DSL). In the code generation stage, the framework automatically parses the provided higher-level descriptive language, constructs computational graphs, and generates stencil operation Unified Compute Device Architecture (CUDA) kernel functions. It also produces different host-side code based on whether a single-GPU or multi-GPU environment is used. During the code optimization stage, candidate parameter ranges are determined according to different GPU models, and they are dynamically invoked by generated CUDA kernel functions to ascertain the optimal parameters. For multi-GPU, the automatically generated host-side code can utilize overlapping computation and communication methods for boundary data exchange. Across four different GPUs and stencil operations with 7-, 13-, 19-, and 27-point configurations, the framework successfully identifies the optimal parameter configuration. Experimental results on the Tesla V100-SXM2 show that with optimized parameters for stencil operations, the framework achieves Trillion Floating-point Operations Per Second (TFLOPs) of 1.230, 1.680, 1.120, and 1.480, respectively, in single precision for the four stencil operations, and 0.690, 1.010, 0.480, and 1.470, respectively, in double precision, with an average performance reaching 98% of hand-optimized code. Additionally, it offers a simpler description and supports multi-GPU extension.

Key words: stencil operation, Compute Unified Device Architecture(CUDA), computational graph, Domain Specific Language(DSL), code generation, automatic tuning

中图分类号:

TP391

刘金硕, 文尧. 模板运算代码的自动生成与调优框架[J]. 计算机工程, 2024, 50(6): 35-47.

LIU Jinshuo, WEN Yao. Auto-Generation and Auto-Tuning Framework of Stencil Operation Code[J]. Computer Engineering, 2024, 50(6): 35-47.

https://www.ecice06.com/CN/Y2024/V50/I6/35

参考文献

[1] DATTA K, MURPHY M, VOLKOV V, et al. Stencil computation optimization and autotuning on state-of-the-art multi core architectures[C]//Proceedings of 2008 ACM/IEEE Conference on Supercomputing. Washington D.C., USA:IEEE Press, 2008:4.
[2] SHIMOKAWABE T, AOKI T, TAKAKI T, et al. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputers[C]//Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA:IEEE Press, 2011:3.
[3] WIENKE S, SPRINGER P, TERBOVEN C, et al. OpenACC-first experiences with real-world applications[C]//Proceedings of European Conference on Parallel Processing. Berlin, Germany:Springer, 2012:859-870.
[4] GROPP W, LUSK E, SKJELLUM A. Using MPI:portable parallel programming with the message-passing interface[M]. Cambridge, USA:MIT Press, 1994.
[5] GABRIEL E, FAGG G E, BOSILCA G, et al. Open MPI:goals, concept, and design of a next generation MPI implementation[C]//Proceedings of European Parallel Virtual Machine/Message Passing Interface Users'Group Meeting. Berlin, Germany:Springer, 2004:97-104.
[6] STELLNER G. CoCheck:checkpointing and process migration for MPI[C]//Proceedings of International Conference on Parallel Processing. Washington D.C., USA:IEEE Press, 1996:526-531.
[7] OH T, BEARD S R, JOHNSON N P, et al. A generalized framework for automatic scripting language parallelization[C]//Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA:IEEE Press, 2017:356-369.
[8] MORENO A, RODRÍGUEZ J J, BELTRÁN D, et al. Designing a benchmark for the performance evaluation of agent-based simulation applications on HPC[J]. The Journal of Supercomputing, 2019, 75(3):1524-1550.
[9] RODRIGUEZ M, BRUALLA L. Many-Integrated Core (MIC) technology for accelerating Monte Carlo simulation of radiation transport:a study based on the code DPM[J]. Computer Physics Communications, 2018, 225:28-35.
[10] KIRK D. NVIDIA CUDA software and GPU parallel computing architecture[C]//Proceedings of the 6th International Symposium on Memory Management. New York, USA:ACM Press, 2007:103-104.
[11] YANG Z Y, ZHU Y T, PU Y. Parallel image processing based on CUDA[C]//Proceedings of International Conference on Computer Science and Software Engineering. Washington D.C., USA:IEEE Press, 2008:198-201.
[12] SANDERS J, KANDROT E. CUDA by example:an introduction to general-purpose GPU programming[M]. Upper Saddle River, USA:Addison-Wesley, 2011
[13] 林琳,祝爱琦,赵明璨,等.晶硅分子动力学模拟的GPU加速算法优化[J].计算机工程, 2023, 49(4):166-173. LIN L, ZHU A Q, ZHAO M C, et al. GPU-accelerated algorithm optimization for molecular dynamics simulation of crystalline silicon[J]. Computer Engineering, 2023, 49(4):166-173.(in Chinese)
[14] 韩彦岭,沈思扬,徐利军,等.面向深度学习图像分类的GPU并行方法研究[J].计算机工程. 2023, 49(1):191-200. HAN Y L,SHEN S Y,XU L J,et al.GPU parallel method for deep learning image classification[J].Computer Engineering, 2023,49(1):191-200.(in Chinese)
[15] CADENELLI N, JAKŠIĆ Z, POLO J, et al. Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads[J]. Future Generation Computer Systems, 2019, 94:148-159.
[16] KUMAR J S, KUMAR G S, AHILAN A. High performance decoding aware FPGA bit-stream compression using RG codes[J]. Cluster Computing, 2019, 22(6):15007-15013.
[17] 李博,黄东强,贾金芳,等.基于CPU与GPU的异构模板计算优化研究[J].计算机工程, 2023, 49(4):131-137. LI B, HUANG D Q, JIA J F, et al. Research on optimization of heterogeneous stencil computing based on CPU and GPU[J]. Computer Engineering, 2023, 49(4):131-137.(in Chinese)
[18] HOLEWINSKI J, POUCHET L N, SADAYAPPAN P. High-performance code generation for stencil computations on GPU architectures[C]//Proceedings of the 26th ACM International Conference on Supercomputing. New York, USA:ACM Press, 2012:311-320.
[19] MATSUMURA K, ZOHOURI H R, WAHIB M, et al. AN5D:automated stencil framework for high-degree temporal blocking on GPUs[C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. New York, USA:ACM Press, 2020:199-211.
[20] VERDOOLAEGE S, JUEGA J C, COHEN A, et al. Polyhedral parallel code generation for CUDA[J]. ACM Transactions on Architecture and Code Optimization, 2013, 9(4):54.
[21] NAWATA T, SUDA R. APTCC:auto parallelizing translator from C to CUDA[J]. Procedia Computer Science, 2011, 4:352-361.
[22] FONSECA A, CABRAL B, RAFAEL J, et al. Automatic parallelization:executing sequential programs on a task-based parallel runtime[J]. International Journal of Parallel Programming, 2016, 44(6):1337-1358.
[23] ZHANG Y Q, CAO T, LI S G,et al. Parallel processing systems for big data:a survey[J]. Proceedings of the IEEE, 2016, 104(11):2114-2136.
[24] HAGEDORN B, STOLTZFUS L, STEUWER M, et al. High performance stencil code generation with Lift[C]//Proceedings of 2018 International Symposium on Code Generation and Optimization. New York, USA:ACM Press, 2018:100-112.
[25] HUANG X M, HUANG X, WANG D, et al. OpenArray v1.0:a simple operator library for the decoupling of ocean modelling and parallel computing[EB/OL].[2023-07-11]. https://www.semanticscholar.org/paper/OpenArray-v1.0%3A-a-simple-operator-library-for-the-Huang-Huang/6e4833c2f6ceb2aeddd10acf185bce735b05aaf2.
[26] TURCHETTO M, PALU A D, VACONDIO R. A general design for a scalable MPI-GPU multi-resolution 2D numerical solver[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(5):1036-1047.
[27] 李雁冰,赵荣彩,韩林,等.一种面向异构众核处理器的并行编译框架[J].软件学报, 2019, 30(4):981-1001. LI Y B, ZHAO R C, HAN L, et al. Parallelizing compilation framework for heterogeneous many-core processors[J]. Journal of Software, 2019, 30(4):981-1001.(in Chinese)
[28] PEARSON C, CHUNG I, XIONG J J, et al. Fast CUDA-aware MPI datatypes without platform support[EB/OL].[2023-07-11]. https://arxiv.org/abs/2012.14363v2.
[29] PEKKILÄ J, VÄISÄLÄ M S, KÄPYLÄ M J, et al. Scalable communication for high-order stencil computations using CUDA-aware MPI[J]. Parallel Computing, 2022, 111:102904.
[30] LI A, SONG S L, CHEN J Y, et al. Evaluating modern GPU interconnect:PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1):94-110.

选择文件类型/文献管理软件名称

选择包含的内容