作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (8): 353-362. doi: 10.19678/j.issn.1000-3428.0068182

• 开发研究与工程应用 • 上一篇    下一篇

一种基于TVM的算子生成加速策略

高伟1, 李帅龙2, 茆琳3, 王磊2, 李颖颖4, 韩林1,*()   

  1. 1. 郑州大学国家超级计算郑州中心, 河南 郑州 450001
    2. 郑州大学计算机与人工智能学院, 河南 郑州 450001
    3. 92196部队19分队, 山东 青岛 266000
    4. 信息工程大学网络空间安全学院, 河南 郑州 450001
  • 收稿日期:2023-08-04 出版日期:2024-08-15 发布日期:2024-08-09
  • 通讯作者: 韩林
  • 基金资助:
    河南省重大科技专项(221100210600)

One Acceleration Strategy for Operator Generation Based on TVM

Wei GAO1, Shuailong LI2, Lin MAO3, Lei WANG2, Yingying LI4, Lin HAN1,*()   

  1. 1. National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou 450001, Henan, China
    2. School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
    3. 19th Squadron, 92196 Troop, Qingdao 266000, Shandong, China
    4. School of Cyberspace Security, Information Engineering University, Zhengzhou 450001, Henan, China
  • Received:2023-08-04 Online:2024-08-15 Published:2024-08-09
  • Contact: Lin HAN

摘要:

随着人工智能(AI)的飞速发展,新算子和底层硬件层出不穷,这给算子库的开发和维护带来了巨大的工作量。单纯基于手工优化来解决AI模型的性能和效率很容易出现瓶颈。TVM深度学习编译器通过代码的自动化生成减轻了手工优化的负担,但同时也存在搜索时间长的问题。为此,针对TVM的自动化代码生成框架Ansor,提出基于梯度提升算法的新代价模型和基于预定义规则的调度空间剪枝优化2种优化策略,旨在加速TVM的自动化代码生成过程,实现模型快速落地与部署,并进一步为人工智能技术的应用提供更高效的解决方案。实验结果表明,通过应用优化后代价模型可以在不损失推理时间的前提下,使得在x86 CPU平台上模型的调优时间减少30%~35%,同时优化后算子性能最高可提升22%,使得在深度计算单元(DCU)平台上模型的调优时间减少20%左右,同时优化后算子平均性能提升5.7%,此外,基于预定义规则的剪枝策略可以有效提升代价模型的收敛速度,并且在原有最佳迭代次数下,模型推理时间可提高7.4%。

关键词: 深度学习编译器, 代价模型, 梯度提升算法, 剪枝策略, 自动调优

Abstract:

With the rapid development of Artificial Intelligence (AI), the continuous emergence of new operators and underlying hardware has increased the workload associated with the development and maintenance of operator libraries. Relying solely on manual optimization to improve the performance and efficiency of AI models can result in bottlenecks. The TVM deep learning compiler alleviates the burden of manual optimization through automated code generation. However, it also suffers from long search times. To address this issue, this study proposes two optimization strategies for Ansor, an automated code generation framework for TVM. The first strategy introduces a new cost model based on a gradient boosting algorithm, whereas the second strategy involves pruning the scheduling space based on predefined rules. The two optimization strategies aim to accelerate the automated code generation process of TVM, enabling quick deployment and implementation of models and providing more efficient solutions for the application of AI technology. The experimental results show that by applying the optimized cost model, the tuning time of the model on the x86 CPU platform can be reduced by 30% to 35% without losing inference time. Simultaneously, the performance of the optimized operator can be improved by up to 22%, thereby reducing the tuning time of the model on the Deep Computing Unit (DCU) platform by approximately 20%. Simultaneously, the average performance of the optimized operator can be improved by 5.7%. In addition, a pruning strategy based on predefined rules can effectively improve the convergence speed of the cost model, and the inference time of the model can be increased by 7.4% under the original optimal number of iterations.

Key words: deep learning compiler, cost model, gradient boosting algorithm, pruning strategy, automatic tuning