面向国产高性能加速器的LLVM编译器设计及优化

doi:10.19678/j.issn.1000-3428.0067000

摘要/Abstract

摘要：

国防科技大学自主研制的高性能加速器采用中央处理器(CPU)+通用数字信号处理器(GPDSP)的片上异构融合架构, 使用超长指令集(VLIW)+单指令多数据流(SIMD)的向量化结构的GPDSP是峰值性能主要支撑的加速核。主流编译器在密集的数据计算指令排布、为指令静态分配硬件执行单元、GPDSP特有的向量指令等方面不能很好地支持高性能加速器。基于低级虚拟器(LLVM)编译框架, 在前寄存器分配调度阶段, 结合峰值寄存器压力感知方法(PERP)、蚁群优化(ACO)算法与GPDSP结构特点, 优化代价模型, 设计支持寄存器压力感知的指令调度模块; 在后寄存器分配阶段提出支持静态功能单元分配的指令调度策略, 通过冲突检测机制保证功能单元分配的正确性, 为指令并行执行提供软件基础; 在后端封装一系列丰富且规整的向量指令接口, 实现对GPDSP向量指令的支持。实验结果表明, 所提出的LLVM编译架构优化方法从功能和性能上实现了对GPDSP的良好支撑, GCC testsuite测试整体性能平均加速比为4.539, SPEC CPU 2017浮点测试整体性能平均加速比为4.49, SPEC CPU 2017整型测试整体性能平均加速比为3.24, 使用向量接口的向量程序实现了平均97.1%的性能提升率。

关键词: 通用数字信号处理器, 低级虚拟器, 编译器, 指令调度, 向量指令接口

Abstract:

National University of Defense Technology independently developed a high-performance accelerator that uses an on-chip heterogeneous fusion architecture of a Central Processing Unit(CPU) and General Purpose Digital Signal Processor(GPDSP). The GPDSP, with its Very Long Instruction Word(VLIW)+ Single Instruction Multiple Datastream(SIMD) vectorization structure, is the main support for the peak performance acceleration core. However, mainstream compilers cannot adequately support high-performance accelerators in intensive data calculation instruction layouts, static allocation of hardware execution units for instructions, and GPDSP-specific vector instructions. In this study, based on the Low Level Virtual Machine(LLVM) compilation framework, the PERP method, Ant Colony Optimization(ACO) algorithm, and GPDSP structural characteristics are combined to optimize the cost model in the pre-RA-sched stage, and the instruction scheduling module is designed to support register pressure awareness. This study proposes an instruction scheduling strategy that supports static functional unit allocation in the post-RA-sched stage, which guarantees correct functional unit allocation through a conflict detection mechanism, and provides a software basis for the parallel execution of instructions. Furthermore, a series of rich and regular vector instruction interfaces are encapsulated in the backend to support the GPDSP vector instructions. The experimental results demonstrate that the LLVM compilation architecture optimization method proposed in this study provides good support for the GPDSP in terms of both functionality and performance. Specifically, the overall performance average speedup ratio of GCC testsuite is 4.539, the overall performance average speedup ratio of SPEC CPU 2017 floating-point test is 4.49, and the overall performance average speedup ratio of SPEC CPU 2017 integer test is 3.24. Additionally, the vector program using vector interfaces achieves an average performance improvement ratio of 97.1%.

Key words: General Purpose Digital Signal Processor(GPDSP), Low Level Virtual Machine(LLVM), compiler, instruction scheduling, vector instruction interface

宋强, 唐俊龙, 陈照云, 时洋, 谭期轩, 肖紫阳, 邹望辉. 面向国产高性能加速器的LLVM编译器设计及优化[J]. 计算机工程, 2024, 50(4): 321-331.

Qiang SONG, Junlong TANG, Zhaoyun CHEN, Yang SHI, Qixuan TAN, Ziyang XIAO, Wanghui ZOU. Design and Optimization of LLVM Compiler for Domestic High Performance Accelerator[J]. Computer Engineering, 2024, 50(4): 321-331.

http://www.ecice06.com/CN/Y2024/V50/I4/321

图/表 15

图1 GPDSP结构

Fig.1 GPDSP structure

图2 LLVM编译器架构

Fig.2 LLVM compiler architecture

图3 寄存器压力感知调度流程

Fig.3 Register pressure-aware scheduling process

图4 冲突检测流程

Fig.4 Conflict detection process

图5 向量指令接口设计方案

Fig.5 Design scheme of vector instruction interface

图6 GCC testsuite整体性能测试结果

Fig.6 Overall performance test results of GCC testsuite

图7 SPEC CPU 2017整体性能测试结果

Fig.7 Overall performance test results of SPEC CPU 2017

图8 寄存器压力感知性能测试结果

Fig.8 Performance test results of register pressure-aware tests

图9 向量程序运行时间测试结果

Fig.9 Test results of vector program runtime

参考文献 25

1	XU Z W, CHI X B, XIAO N. High-performance computing environment: a review of twenty years of experiments in China. National Science Review, 2016, 3(1): 36- 48. doi: 10.1093/nsr/nww001
2	WANG H Q, PENG S L, ZHU X Q, et al. A method to accelerate GROMACS in offload mode on Tianhe-2 supercomputer[C]//Proceedings of the 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. Washington D. C., USA: IEEE Press, 2015: 781-784.
3	PIÑEIRO C, PICHEL J C. A unified framework to improve the interoperability between HPC and Big Data languages and programming models. Future Generation Computer Systems, 2022, 134, 123- 139. doi: 10.1016/j.future.2022.04.002
4	YIN F, SHI F. A comparative survey of big data computing and HPC: from a parallel programming model to a cluster architecture. International Journal of Parallel Programming, 2022, 50(1): 27- 64. doi: 10.1007/s10766-021-00717-y
5	HEINECKE A, BREUER A, RETTENBERGER S, et al. Petascale high order dynamic rupture earthquake simulations on heterogeneous supercomputers[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D. C., USA: IEEE Press, 2014: 3-14.
6	YAN D, WANG W, CHU X. An LLVM-based open-source compiler for NVIDIA GPUs[C]//Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2022: 448-449.
7	SHOBAKI G, KERBOW A, PULIDO C, et al. Exploring an alternative cost function for combinatorial register-pressure-aware instruction scheduling. ACM Transactions on Architecture and Code Optimization, 2019, 16(1): 1- 30.
8	CHEN S M, WANG Y H, LIU S, et al. FT-Matrix: a coordination-aware architecture for signal processing. IEEE Micro, 2014, 34(6): 64- 73. doi: 10.1109/MM.2013.129
9	荀长庆, 陈照云, 文梅, 等. 以编译为导向的Matrix-DSP程序分析与优化. 计算机工程与科学, 2020, 42(10): 1791- 1800. doi: 10.3969/j.issn.1007-130X.2020.10.011
	XUN C Q, CHEN Z Y, WEN M, et al. Compilation-oriented code analysis and optimization for Matrix-DSP. Computer Engineering & Science, 2020, 42(10): 1791- 1800. doi: 10.3969/j.issn.1007-130X.2020.10.011
10	PANDEY M, SARDA S. LLVM cookbook. [S. l.]: Packt, 2015: 296.
11	LOZANO R C, CARLSSON M, DREJHAMMAR F, et al. Constraint-based register allocation and instruction scheduling[C]//Proceedings of International Conference on Principles and Practice of Constraint Programming. Berlin, Germany: Springer, 2012: 750-766.
12	SHOBAKI G, GORDON V S, MCHUGH P, et al. Register-pressure-aware instruction scheduling using ant colony optimization. ACM Transactions on Architecture and Code Optimization, 19(2): 23.
13	DORIGO M, MANIEZZO V, COLORNI A. Ant system: optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 1996, 26(1): 29- 41. doi: 10.1109/3477.484436
14	刘胜, 卢凯, 郭阳, 等. 一种自主设计的面向E级高性能计算的异构融合加速器. 计算机研究与发展, 2021, 58(6): 1234- 1237. URL
	LIU S, LU K, GUO Y, et al. A self-designed heterogeneous accelerator for exascale high performance computing. Journal of Computer Research and Development, 2021, 58(6): 1234- 1237. URL
15	GIESEMANN F, GERLACH L, PAYÁ-VAYÁ G. Evolutionary algorithms for instruction scheduling, operation merging, and register allocation in VLIW compilers. Journal of Signal Processing Systems, 2020, 92(7): 655- 678. doi: 10.1007/s11265-019-01493-2
16	LOZANO R C, CARLSSON M, BLINDELL G H, et al. Combinatorial register allocation and instruction scheduling. ACM Transactions on Programming Languages and Systems, 41(3): 17.
17	MALEKI S, GAO Y Q, GARZAR'N M J, et al. An evaluation of vectorizing compilers[C]//Proceedings of International Conference on Parallel Architectures and Compilation Techniques. Washington D. C., USA: IEEE Press, 2011: 372-382.
18	李嘉楠, 韩林, 柴赟达. 面向国产平台的LLVM自动向量化移植与优化. 计算机工程, 2022, 48(1): 142- 148. URL
	LI J N, HAN L, CHAI Y D. Automatic vectorization transplant and optimization of LLVM for domestic processors. Computer Engineering, 2022, 48(1): 142- 148. URL
19	冯竞舸, 贺也平, 陶秋铭. 自动向量化: 近期进展与展望. 通信学报, 2022, 43(3): 180- 195. URL
	FENG J G, HE Y P, TAO Q M. Auto-vectorization: recent development and prospect. Journal on Communications, 2022, 43(3): 180- 195. URL
20	MAMMADLI R, JANNESARI A, WOLF F. Static neural compiler optimization via deep reinforcement learning[C]//Proceedings of 2020 IEEE/ACM Workshop on the LLVM Compiler Infrastructure in HPC(LLVM-HPC) and Workshop on Hierarchical Parallelism for Exascale Computing(HiPar). Washington D. C., USA: IEEE Press, 2020: 1-10.
21	WU L, PEI J, TANG J, et al. Deep learning on graphs: methods and applications[C]//Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2022: 4906-4907.
22	WU Z H, PAN S R, CHEN F W, et al. A comprehensive survey on graph neural networks. IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(1): 4- 24. doi: 10.1109/TNNLS.2020.2978386
23	FEY M, LENSSEN J E. Fast graph representation learning with PyTorch geometric[EB/OL]. [2023-01-02]. http://arxiv.org/pdf/1903.02428.
24	WANG M J, YU L F, ZHENG D, et al. Deep graph library: towards efficient and scalable deep learning on graphs[EB/OL]. [2023-01-02]. http://arxiv.org/abs/1909.01315v1.
25	池昊宇, 陈长波. 基于机器学习的编译器自动调优综述. 计算机科学, 2022, 49(1): 241- 251. URL
	CHI H Y, CHEN C B. Survey on automatic tuning of compilers by machine learning. Computer Science, 2022, 49(1): 241- 251. URL

[1]	李嘉楠, 韩林, 柴赟达. 面向国产平台的LLVM自动向量化移植与优化[J]. 计算机工程, 2022, 48(1): 142-148.
[2]	曹代,郭绍忠,张辛. 基于申威26010处理器的扩展函数库实现与优化[J]. 计算机工程, 2017, 43(1): 61-66,71.
[3]	蒋凡,范秀萍. TTCN-3语言中基于过程通信的编译器实现[J]. 计算机工程, 2016, 42(8): 9-13.
[4]	赵高义,郑启龙. BWDSP104X字节寻址模式扩展及64位数据运算模拟实现[J]. 计算机工程, 2016, 42(8): 14-18,23.
[5]	曹晓，李莹. 基于反馈的JCVM指令预调度方案[J]. 计算机工程, 2014, 40(1): 78-82.
[6]	刘家兵，徐云. X86平台上Open64软件流水的设计与实现[J]. 计算机工程, 2013, 39(9): 15-19.
[7]	甄真, 陈虎, 张林亚. 列数据库的SQL查询语句编译与优化[J]. 计算机工程, 2013, 39(6): 60-65.
[8]	黄品丰, 赵荣彩, 韩林, 刘晓娴. OpenMP数据分布子句自动生成算法[J]. 计算机工程, 2013, 39(3): 295-299.
[9]	李清波, 苟鹏飞, 孙骏, 杨兵, 王进祥. 一种静态LoC关键性预测器设计[J]. 计算机工程, 2012, 38(7): 253-256.
[10]	刘石柱, 尹首一, 殷崇勇, 刘雷波, 魏少军. 基于可重构处理器的并行优化算法[J]. 计算机工程, 2012, 38(21): 286-289.
[11]	王少培, 吴健, 阮园. CoSy C语言编译器安全性研究[J]. 计算机工程, 2012, 38(06): 43-46.
[12]	魏雪菲, 吴健, 阮园. 基于错误模式和模型检验的静态代码分析方法[J]. 计算机工程, 2012, 38(06): 47-49.
[13]	赵捷, 赵荣彩, 丁锐, 陈达智. 基于Define-Use图的MPI通信求解算法[J]. 计算机工程, 2012, 38(04): 247-250.
[14]	郝云龙, 赵荣彩, 侯永生, 朱嘉风. 反馈式编译在循环级性能分析中的应用[J]. 计算机工程, 2011, 37(9): 32-34.
[15]	孔凡金, 黄春. 基于值剖视的编译优化[J]. 计算机工程, 2011, 37(6): 58-60.

选择文件类型/文献管理软件名称

选择包含的内容