作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (4): 321-331. doi: 10.19678/j.issn.1000-3428.0067000

• 开发研究与工程应用 • 上一篇    下一篇

面向国产高性能加速器的LLVM编译器设计及优化

宋强1,2,*(), 唐俊龙1, 陈照云2, 时洋2, 谭期轩2, 肖紫阳2, 邹望辉1   

  1. 1. 长沙理工大学物理与电子科学学院, 湖南 长沙 410114
    2. 国防科技大学计算机学院, 湖南 长沙 410073
  • 收稿日期:2023-02-22 出版日期:2024-04-15 发布日期:2023-08-25
  • 通讯作者: 宋强
  • 基金资助:
    国家自然科学基金(62002366); 柔性电子材料基因工程湖南省重点实验室开放基金(202015)

Design and Optimization of LLVM Compiler for Domestic High Performance Accelerator

Qiang SONG1,2,*(), Junlong TANG1, Zhaoyun CHEN2, Yang SHI2, Qixuan TAN2, Ziyang XIAO2, Wanghui ZOU1   

  1. 1. School of Physical and Electronic Sciences, Changsha University of Science and Technology, Changsha 410114, Hunan, China
    2. School of Computer, National University of Defense Technology, Changsha 410073, Hunan, China
  • Received:2023-02-22 Online:2024-04-15 Published:2023-08-25
  • Contact: Qiang SONG

摘要:

国防科技大学自主研制的高性能加速器采用中央处理器(CPU)+通用数字信号处理器(GPDSP)的片上异构融合架构, 使用超长指令集(VLIW)+单指令多数据流(SIMD)的向量化结构的GPDSP是峰值性能主要支撑的加速核。主流编译器在密集的数据计算指令排布、为指令静态分配硬件执行单元、GPDSP特有的向量指令等方面不能很好地支持高性能加速器。基于低级虚拟器(LLVM)编译框架, 在前寄存器分配调度阶段, 结合峰值寄存器压力感知方法(PERP)、蚁群优化(ACO)算法与GPDSP结构特点, 优化代价模型, 设计支持寄存器压力感知的指令调度模块; 在后寄存器分配阶段提出支持静态功能单元分配的指令调度策略, 通过冲突检测机制保证功能单元分配的正确性, 为指令并行执行提供软件基础; 在后端封装一系列丰富且规整的向量指令接口, 实现对GPDSP向量指令的支持。实验结果表明, 所提出的LLVM编译架构优化方法从功能和性能上实现了对GPDSP的良好支撑, GCC testsuite测试整体性能平均加速比为4.539, SPEC CPU 2017浮点测试整体性能平均加速比为4.49, SPEC CPU 2017整型测试整体性能平均加速比为3.24, 使用向量接口的向量程序实现了平均97.1%的性能提升率。

关键词: 通用数字信号处理器, 低级虚拟器, 编译器, 指令调度, 向量指令接口

Abstract:

National University of Defense Technology independently developed a high-performance accelerator that uses an on-chip heterogeneous fusion architecture of a Central Processing Unit(CPU) and General Purpose Digital Signal Processor(GPDSP). The GPDSP, with its Very Long Instruction Word(VLIW)+ Single Instruction Multiple Datastream(SIMD) vectorization structure, is the main support for the peak performance acceleration core. However, mainstream compilers cannot adequately support high-performance accelerators in intensive data calculation instruction layouts, static allocation of hardware execution units for instructions, and GPDSP-specific vector instructions. In this study, based on the Low Level Virtual Machine(LLVM) compilation framework, the PERP method, Ant Colony Optimization(ACO) algorithm, and GPDSP structural characteristics are combined to optimize the cost model in the pre-RA-sched stage, and the instruction scheduling module is designed to support register pressure awareness. This study proposes an instruction scheduling strategy that supports static functional unit allocation in the post-RA-sched stage, which guarantees correct functional unit allocation through a conflict detection mechanism, and provides a software basis for the parallel execution of instructions. Furthermore, a series of rich and regular vector instruction interfaces are encapsulated in the backend to support the GPDSP vector instructions. The experimental results demonstrate that the LLVM compilation architecture optimization method proposed in this study provides good support for the GPDSP in terms of both functionality and performance. Specifically, the overall performance average speedup ratio of GCC testsuite is 4.539, the overall performance average speedup ratio of SPEC CPU 2017 floating-point test is 4.49, and the overall performance average speedup ratio of SPEC CPU 2017 integer test is 3.24. Additionally, the vector program using vector interfaces achieves an average performance improvement ratio of 97.1%.

Key words: General Purpose Digital Signal Processor(GPDSP), Low Level Virtual Machine(LLVM), compiler, instruction scheduling, vector instruction interface