作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• •    

基于国产GPGPU非一致控制流的分支优化

  • 出版日期:2025-04-09 发布日期:2025-04-09

Branch optimization based on domestic GPGPU non-consistent control flow

  • Online:2025-04-09 Published:2025-04-09

摘要: 目前,通用图形处理单元(GPGPU)因其强大的并行处理能力而被广泛应用于各种计算任务。然而,采用单指令多线程(SIMT)并行执行模型的GPGPU在程序执行时,核函数会产生非一致控制流行为,从而引发线程束分化,降低加速器的整体性能。针对核函数执行过程中因非一致控制流引起的性能下降问题,提出一种特定场景下的分支编译优化方法——连续分支合并(MergeCFG)。在编译器中间代码优化阶段,通过控制流分析,识别出控制流图中含相同条件跳转的连续分支结构,以确定潜在的优化机会。接着,基于指令分析,评估优化的可行性,判断是否存在能够减少分支跳转的优化空间。最后,运用基本块复制与合并技术,对控制流结构进行优化,以减少程序中的分支跳转操作,从而简化控制流,提高程序执行效率。为验证方法的可行性,在国产GPGPU上使用7个合适的基准测试套件进行实验。结果显示,该方法有效减少了程序中的分支跳转操作,优化后的测试用例在性能上获得了显著提升。所测用例的平均加速比提高了2%至12%,个别测试用例的性能提升超过5倍。

Abstract: Currently, General-Purpose Graphics Processing Units (GPGPUs) are widely utilized for various computational tasks due to their robust parallel processing capabilities. However, GPGPUs employing the Single Instruction Multiple Threads (SIMT) execution model often encounter divergent control flow during kernel execution, leading to warp divergence and a subsequent decline in overall accelerator performance. To address the performance degradation caused by divergent control flow in kernel execution, this paper introduces a branch compilation optimization technique tailored for specific scenarios—MergeCFG. During the intermediate code optimization phase in the compiler, MergeCFG conducts control flow analysis to identify consecutive branch structures in the control flow graph that share identical conditional branches, thereby pinpointing potential optimization opportunities. Subsequently, based on instruction analysis, it assesses the feasibility of optimization to determine whether there exists an opportunity to reduce branch operations. Finally, by employing basic block duplication and merging techniques, it optimizes the control flow structure to minimize branch operations, thereby simplifying control flow and enhancing program execution efficiency. To validate the feasibility of this method, experiments were conducted on a domestic GPGPU using seven suitable benchmark test suites. The results demonstrate that this method effectively reduces branch operations within programs, leading to significant performance improvements in the optimized test cases. The average speedup across the evaluated cases ranged from 2% to 12%, with certain test cases exhibiting performance enhancements exceeding fivefold.