基于国产GPGPU非一致控制流的分支优化

doi:10.19678/j.issn.1000-3428.0070570

摘要/Abstract

摘要： 目前，通用图形处理单元（GPGPU）因其强大的并行处理能力而被广泛应用于各种计算任务。然而，采用单指令多线程（SIMT）并行执行模型的GPGPU在程序执行时，核函数会产生非一致控制流行为，从而引发线程束分化，降低加速器的整体性能。针对核函数执行过程中因非一致控制流引起的性能下降问题，提出一种特定场景下的分支编译优化方法——连续分支合并（MergeCFG）。在编译器中间代码优化阶段，通过控制流分析，识别出控制流图中含相同条件跳转的连续分支结构，以确定潜在的优化机会。接着，基于指令分析，评估优化的可行性，判断是否存在能够减少分支跳转的优化空间。最后，运用基本块复制与合并技术，对控制流结构进行优化，以减少程序中的分支跳转操作，从而简化控制流，提高程序执行效率。为验证方法的可行性，在国产GPGPU上使用7个合适的基准测试套件进行实验。结果显示，该方法有效减少了程序中的分支跳转操作，优化后的测试用例在性能上获得了显著提升。所测用例的平均加速比提高了2%至12%，个别测试用例的性能提升超过5倍。

Abstract: Currently, General-Purpose Graphics Processing Units (GPGPUs) are widely utilized for various computational tasks due to their robust parallel processing capabilities. However, GPGPUs employing the Single Instruction Multiple Threads (SIMT) execution model often encounter divergent control flow during kernel execution, leading to warp divergence and a subsequent decline in overall accelerator performance. To address the performance degradation caused by divergent control flow in kernel execution, this paper introduces a branch compilation optimization technique tailored for specific scenarios—MergeCFG. During the intermediate code optimization phase in the compiler, MergeCFG conducts control flow analysis to identify consecutive branch structures in the control flow graph that share identical conditional branches, thereby pinpointing potential optimization opportunities. Subsequently, based on instruction analysis, it assesses the feasibility of optimization to determine whether there exists an opportunity to reduce branch operations. Finally, by employing basic block duplication and merging techniques, it optimizes the control flow structure to minimize branch operations, thereby simplifying control flow and enhancing program execution efficiency. To validate the feasibility of this method, experiments were conducted on a domestic GPGPU using seven suitable benchmark test suites. The results demonstrate that this method effectively reduces branch operations within programs, leading to significant performance improvements in the optimized test cases. The average speedup across the evaluated cases ranged from 2% to 12%, with certain test cases exhibiting performance enhancements exceeding fivefold.

吴艺鹏, 霍志坤, 韩孟之. 基于国产GPGPU非一致控制流的分支优化[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0070570.

Yipeng Wu, Zhikun Huo, Mengzhi Han. Branch optimization based on domestic GPGPU non-consistent control flow[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0070570.

参考文献

[1] 庞文豪,王嘉伦,翁楚良.GPGPU 和 CUDA 统一内存研究现状综述[J/OL].计算机工程,1-22[2024-10-19].https://doi. org/10.19678/j.issn.1000-3428.0068694. PANG W H, WANG J L, WENG C L. Survey on G PGPU and CUDA Unified Memory Research Status[J/ OL].Computer Engineering,1-22[2024-10-20].https://doi. org/10.19678/j.issn.1000-3428.0068694.
[2] Multithreaded M , In P , Via C O R ,et al.NVIDIA T ESLA : A U NIFIED G RAPHICS AND C OMPU TING A RCHITECTURE COMPUTING ARCHITECT URE. I TS SCALABLE PARALLEL ARRAY OF PR OCESSORS IS[J]. 2008.
[3] Steffen M , Zambreno J .Improving SIMT Efficiency of Global Rendering Algorithms with Architectural Su pport for Dynamic Micro-Kernels[C]//IEEE/ACM Inter national Symposium on Microarchitecture.ACM, 2010. DOI:10.1109/MICRO.2010.45.
[4] Nugteren C , Braak G J V D , Corporaal H .Future of GPGPU micro-architectural parameters[C]//Proceedin gs of the Conference on Design, Automation and Test in Europe.IEEE, 2013.DOI:10.7873/DATE.2013.089.
[5] Khorasani F , Gupta R , Bhuyan L N .Efficient warp execution in presence of divergence with collaborative context collection[C]//IEEE/ACM International Sympo sium on Microarchitecture.IEEE, 2015:204-215.DOI:10. 1145/2830772.2830796.
[6] Minsoo,Rhu,Mattan,et al.CAPRI: Prediction of Compacti on-Adequacy for Handling Control-Divergence in GPG PU Architectures[J].Computer Architecture News, 201 2.
[7] 王旭昊,唐甜.一种源源编译控制流优化方法[J].航空计算技术,2012,42(03):98-103. WANG X H, TANG T. A Optimization Method of So urce-to-Source Compiler Control Flow [J]. Aeronautica l Computing Technique, 2012, 42(03): 98-103.
[8] Chen W K , Li B , Gupta R .Code Compaction of Matching Single-Entry Multiple-Exit Regions[C]//Intern ational symposium on static analysis.2003.
[9] Coutinho B , Sampaio D , Pereira F M Q ,et al.Dive rgence Analysis and Optimizations[J].IEEE Computer Society, 2011.DOI:10.1109/PACT.2011.63.
[10] Saumya C , Sundararajah K , Kulkarni M .DARM: Control-Flow Melding for SIMT Thread Divergence R eduction -- Extended Version[J]. 2021.DOI:10.48550/ar Xiv.2107.05681.
[11] Smith T F , Waterman M S .Identification of commo n molecular subsequences.[J].Journal of Molecular Biol ogy, 1981, 147(1):195-197.DOI:10.1016/0022-2836(81)9 0087-5.
[12] Lattner C , Adve V .LLVM: A Compilation Framewor k for Lifelong Program Analysis & Transformation[J].I EEE, 2004.DOI:10.1109/CGO.2004.1281665.
[13] LLVM. The LLVM Compiler Infrastructure[EB/OL]. [2 024-10-20]. https://llvm.org/.
[14] NVCC. NVIDIACUDA Toolkit Documentation[EB/OL]. [2024-10-20]. https://docs.nvidia.com/cuda/archive/11.2. 1/cuda-compiler-driver-nvcc/
[15] Roberto Castañeda Lozano, Carlsson M , Drejhammar F ,et al.Constraint-Based Register Allocation and Inst ruction Scheduling[C]//International Conference on Prin ciples & Practice of Constraint Programming.2012.DOI: 10.1007/978-3-642-33558-7_54.
[16] 杨太龙,赵红朋,张磊.基于国产异构平台的奇异值分解法 [J].计算机工程, 2024(9). YANG T L, ZHAO H P, ZHANG L. Singular Value Decomposition Based on Domestic Heterogeneous Plat forms [J]. Computer Engineering, 2024(9).
[17] Liu J , Wu Z , Yu D ,et al.HeterPS: Distributed Dee p Learning With Reinforcement Learning Based Sched uling in Heterogeneous Environments[J]. 2021.DOI:10. 48550/arXiv.2111.10635.
[18] 张军，魏继桢，沈凡凡，等. 基于 GPGPU-sim 的多 k ernel 场景下 GPGPU 性能优化实验方法[J]. 实验技术与管理, 2024, 41(7):87-93. ZHANG J, WEI J Z, SHEN F F, et al. Experimental method for optimizing GPGPU performance in a mul tiple-kernel environment based on GPGPU-sim[J]. Exp erimental Technology and Management, 2024, 41(7): 8 7-93. (in Chinese)
[19] AMD. AMD ROCm™ Documentation[EB/OL]. [2024- 10-20]. https://rocm.docs.amd.com/en/latest/.
[20] Cytron R , Ferrante J , Rosen B K ,et al.Efficiently c omputing static single assignment form and the contro l dependence graph[J].Acm Trans.prog.lang.syst, 1991, 13(4):451-490.DOI:10.1145/115372.115320.
[21] PassManager. llvm::PassManager< IRUnitT, AnalysisMa nagerT, ExtraArgTs > Class Template Reference[EB/O L]. [2024-10-21]. https://llvm.org/doxygen/classllvm_1_ 1PassManager.html
[22] Huang J C , Leng T .Generalized loop-unrolling: a m ethod for program speedup[J].IEEE, 1999.DOI:10.1109/ ASSET.1999.756775.
[23] Rodriguezcancio M , Combemale B , Baudry B .Auto matic Microbenchmark Generation to Prevent Dead Co de Elimination and Constant Folding[J].ACM, 2016.D OI:10.1145/2970276.2970346.
[24] Jin Z , Vetter J S .A Benchmark Suite for Improving Performance Portability of the SYCL Programming M odel[C]//2023 IEEE International Symposium on Perfor mance Analysis of Systems and Software (ISPASS).0 [2024-10-21].DOI:10.1109/ISPASS57527.2023.00041.
[25] Tensile. AMD ROCm™ Software [EB/OL]. [2024-10-2 1]. https://github.com/ROCm/Tensile.

选择文件类型/文献管理软件名称

选择包含的内容