一种基于TVM的算子生成加速策略

doi:10.19678/j.issn.1000-3428.0068182

摘要/Abstract

摘要：

随着人工智能(AI)的飞速发展，新算子和底层硬件层出不穷，这给算子库的开发和维护带来了巨大的工作量。单纯基于手工优化来解决AI模型的性能和效率很容易出现瓶颈。TVM深度学习编译器通过代码的自动化生成减轻了手工优化的负担，但同时也存在搜索时间长的问题。为此，针对TVM的自动化代码生成框架Ansor，提出基于梯度提升算法的新代价模型和基于预定义规则的调度空间剪枝优化2种优化策略，旨在加速TVM的自动化代码生成过程，实现模型快速落地与部署，并进一步为人工智能技术的应用提供更高效的解决方案。实验结果表明，通过应用优化后代价模型可以在不损失推理时间的前提下，使得在x86 CPU平台上模型的调优时间减少30%~35%，同时优化后算子性能最高可提升22%，使得在深度计算单元(DCU)平台上模型的调优时间减少20%左右，同时优化后算子平均性能提升5.7%，此外，基于预定义规则的剪枝策略可以有效提升代价模型的收敛速度，并且在原有最佳迭代次数下，模型推理时间可提高7.4%。

关键词: 深度学习编译器, 代价模型, 梯度提升算法, 剪枝策略, 自动调优

Abstract:

With the rapid development of Artificial Intelligence (AI), the continuous emergence of new operators and underlying hardware has increased the workload associated with the development and maintenance of operator libraries. Relying solely on manual optimization to improve the performance and efficiency of AI models can result in bottlenecks. The TVM deep learning compiler alleviates the burden of manual optimization through automated code generation. However, it also suffers from long search times. To address this issue, this study proposes two optimization strategies for Ansor, an automated code generation framework for TVM. The first strategy introduces a new cost model based on a gradient boosting algorithm, whereas the second strategy involves pruning the scheduling space based on predefined rules. The two optimization strategies aim to accelerate the automated code generation process of TVM, enabling quick deployment and implementation of models and providing more efficient solutions for the application of AI technology. The experimental results show that by applying the optimized cost model, the tuning time of the model on the x86 CPU platform can be reduced by 30% to 35% without losing inference time. Simultaneously, the performance of the optimized operator can be improved by up to 22%, thereby reducing the tuning time of the model on the Deep Computing Unit (DCU) platform by approximately 20%. Simultaneously, the average performance of the optimized operator can be improved by 5.7%. In addition, a pruning strategy based on predefined rules can effectively improve the convergence speed of the cost model, and the inference time of the model can be increased by 7.4% under the original optimal number of iterations.

Key words: deep learning compiler, cost model, gradient boosting algorithm, pruning strategy, automatic tuning

高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.

Wei GAO, Shuailong LI, Lin MAO, Lei WANG, Yingying LI, Lin HAN. One Acceleration Strategy for Operator Generation Based on TVM[J]. Computer Engineering, 2024, 50(8): 353-362.

https://www.ecice06.com/CN/Y2024/V50/I8/353

图/表 14

图1 TVM整体编译流程

Fig.1 Overall compilation procedure of TVM

图2 Ansor流程

Fig.2 Ansor procedure

图3 原始特征和重构特征

Fig.3 Original features and reconstructed features

图4 代价模型整体设计图

Fig.4 Overall design diagram of cost model

图5 CPU上优化前后算子性能对比

Fig.5 Performance comparison of operators before and after optimization on CPU

图6 DCU上优化前后算子性能对比

Fig.6 Performance comparison of operators before and after optimization on DCU

图7 CPU上优化时间与推理时间对比

Fig.7 Comparison of optimization time and inference time on CPU

图8 DCU优化时间和推理时间对比

Fig.8 Comparison of optimization time and inference time on DCU

图9 剪枝策略应用前后算子执行时间对比

Fig.9 Comparison of operator execution time before and after pruning strategy application

图10 剪枝优化应用前后模型的推理时间对比

Fig.10 Comparison of inference time of models before and after pruning optimization application

参考文献 26

1	池昊宇, 陈长波. 基于机器学习的编译器自动调优综述. 计算机科学, 2022, 49(1): 241- 251. URL
	CHI H Y, CHEN C B. Survey on automatic tuning of compilers by machine learning. Computer Science, 2022, 49(1): 241- 251. URL
2	LI M Z, LIU Y, LIU X Y, et al. The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2021, 32(3): 708- 727.
3	CHEN T Q, MOREAU T, JIANG Z H, et al. TVM: end-to-end optimization stack for deep learning[EB/OL]. [2023-07-01]. https://arxiv.org/abs/1802.04799v1.
4	ZHAO J, LI B J, NIE W, et al. AKG: automatic kernel generation for neural processing units using polyhedral transformations[C]//Proceedings of the 42nd ACM SIGPLAN International Conference on Programming Language Design and Implementation. New York, USA: ACM Press, 2021: 1233-1248.
5	LIANG G M, YUAN C Y, YU M S, et al. The support of MLIR HLS adaptor for LLVM IR[C]//Proceedings of the 51st International Conference on Parallel Processing. New York, USA: ACM Press, 2022: 1-8.
6	VASILACHE N, ZINENKO O, THEODORIDIS T, et al. Tensor comprehensions: framework-agnostic high-performance machine learning abstractions[EB/OL]. [2023-07-01]. https://arxiv.org/pdf/1802.04730.
7	ZHENG L M, JIA C F, SUN M M, et al. Ansor: generating high-performance Tensor programs for deep learning[C]//Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation. New York, USA: ACM Press. 2020: 863-789.
8	杨思驰, 赵荣彩, 韩林, 等. 面向DCU的共享内存访问向量化. 计算机工程, 2024, 50(2): 206- 213. URL
	YANG S C, ZHAO R C, HAN L, et al. Vectorization optimization of shared memory access for DCU. Computer Engineering, 2024, 50(2): 206- 213. URL
9	ZHU H Y, WU R F, DIAO Y J, et al. Roller: fast and efficient tensor compilation for deep learning[C]//Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation. Carlsbad, USA: USENIX Association, 2022: 233-248.
10	刘功晗, 李悦, 王晓玲. 面向航天异构平台的深度学习编译器加速技术优化. 航天控制, 2022, 40(2): 60- 65. URL
	LIU G H, LI Y, WANG X L. Optimization of deep learning compiler acceleration technology for aerospace heterogeneous platforms. Aerospace Control, 2022, 40(2): 60- 65. URL
11	潘秋红, 何水兵, 陈刚, 等. 一种用于深度学习编译器中探索优化空间的加速方法: 112579063[P]. 2021-06-08.
	PAN Q H, HE S B, CHEN G, et al. One approach to accelerating the exploration of optimization space in a deep learning compiler is as follows: 112579063[P]. 2021-06-08. (in Chinese)
12	赵佳棋. 基于深度强化学习的编译器自动优化方法研究[D]. 西安: 西北大学, 2022.
	ZHAO J Q. Research on compiler auto-tuning method based on deep reinforce learning[D]. Xi'an: Northwest University, 2022. (in Chinese)
13	ABADI M, BARHAM P, CHEN J M, et al. TensorFlow: large-scale machine learning on heterogeneous distributed systems[C]//Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI'16). Carlsbad, USA: USENIX Association, 2016: 265-283.
14	PASZKE A, GROSS S, MASSA F, et al. PyTorch: an imperative style, high-performance deep learning library[EB/OL]. [2023-07-01]. https://arxiv.org/abs/1912.01703.
15	GASKILL B. ONNX: the open neural network exchange format. Linux Journal, 2018, 285, 157- 161.
16	ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York, USA: ACM Press, 2018: 58-68.
17	申云飞, 申飞, 李芳, 等. 基于张量虚拟机的深度神经网络模型加速方法. 计算机应用, 2023, 43(9): 2836- 2844. URL
	SHEN Y F, SHEN F, LI F, et al. Deep neural network model acceleration method based on tensor virtual machine. Journal of Computer Applications, 2023, 43(9): 2836- 2844. URL
18	CHEN T Q, ZHENG L M, YAN E, et al. Learning to optimize Tensor programs[C]//Proceedings of the 32nd International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2018: 3393-3404.
19	WU J Y, BELEVICH A, BENDERSKY E, et al. Gpucc: an open-source GPGPU compiler[C]//Proceedings of the 2016 International Symposium on Code Generation and Optimization. New York, USA: ACM Press, 2016: 105-116.
20	VIKHAR P A. Evolutionary algorithms: a critical review and its future prospects[C]//Proceedings of the International Conference on Global Trends in Signal Processing, Information Computing and Communication (ICGTSPICC). Washington D. C., USA: IEEE Press, 2016: 261-265.
21	DOROGUSH A V, ERSHOV V, GULIN A. CatBoost: gradient boosting with categorical features support[EB/OL]. [2023-07-01]. http://arxiv.org/abs/1810.11363.
22	FRIEDMAN J H. Greedy function approximation: a gradient boosting machine. The Annals of Statistics, 2001, 29(5): 1189- 1232.
23	CHEN T Q, GUESTRIN C. XGBoost: a scalable tree boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York, USA: ACM Press, 2016: 785-794.
24	KE G L, MENG Q, FINELY T, et al. LightGBM: a highly efficient gradient boosting decision tree[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 3149-3157.
25	BENTÉJAC C, CSÖRGŐ A, MARTÍNEZ-MUÑOZ G. A comparative analysis of gradient boosting algorithms. Artificial Intelligence Review, 2021, 54(3): 1937- 1967.
26	HOLEWINSKI J, POUCHET L N, SADAYAPPAN P. High-performance code generation for stencil computations on GPU architectures[C]//Proceedings of the 26th ACM International Conference on Supercomputing. New York, USA: ACM Press, 2012: 311-320.

[1]	刘金硕, 文尧. 模板运算代码的自动生成与调优框架[J]. 计算机工程, 2024, 50(6): 35-47.
[2]	付嘉豪, 杨嘉怡, 李爱国. 面向安防系统的高效用语义轨迹模式挖掘[J]. 计算机工程, 2023, 49(6): 62-70.
[3]	杜明, 郝燕, 周军锋, 谭玉婷. 一种高效的周期团挖掘方法[J]. 计算机工程, 2023, 49(4): 68-76.
[4]	张伟成, 卫红权, 刘树新, 王庚润. 面向5G MEC基于行为的用户异常检测方案[J]. 计算机工程, 2022, 48(5): 27-34.
[5]	赵欣灿, 朱云, 毛伊敏. 基于MapReduce的高维数据频繁项集挖掘[J]. 计算机工程, 2022, 48(3): 81-89.
[6]	曹中潇, 冯仰德, 王珏, 闵维潇, 姚铁锤, 高岳, 王丽华, 高付海. 基于深度学习的稀疏矩阵向量乘运算性能预测模型[J]. 计算机工程, 2022, 48(2): 86-91.
[7]	江慧芳,蔡达,王晓蕊. 基于CPU-GPU异构环境的运算代价评估模型[J]. 计算机工程, 2017, 43(9): 12-16.
[8]	韦航,王永恒. 基于主题的中文微博情感分析[J]. 计算机工程, 2015, 41(9): 238-244.
[9]	李雨明,邱卫东,徐赛赛,郭英凯. 一种挖掘不确定数据最大模式的深度优先算法[J]. 计算机工程, 2015, 41(7): 204-209.
[10]	张志刚, 黄刘生, 金宗安, 项莉萍. 基于父子等价剪枝策略的最大频繁项集挖掘[J]. 计算机工程, 2013, 39(4): 219-221,225.
[11]	谢岳山，樊晓平，廖志芳，周国恩，刘世杰. 基于相似孤立系数的孤立点检测算法[J]. 计算机工程, 2013, 39(11): 200-204.
[12]	张媛媛, 赵荣彩, 韩林. 基于多面体表示的向量化收益评估方法[J]. 计算机工程, 2012, 38(7): 266-268,272.

选择文件类型/文献管理软件名称

选择包含的内容