面向深度学习编译器的多粒度量化框架支持与优化

doi:10.19678/j.issn.1000-3428.0069206

摘要/Abstract

摘要：

随着各大厂商对大模型应用部署需求的激增, 深度学习编译器TVM(Tensor Virtual Machine)的单一量化方式精度下降, 已无法满足部署需求。设计并构建一种可选粒度的模型量化框架, 具体包括逐层与逐通道量化流程的支持, 以及阈值搜索与自适应舍入优化算法的实现。首先, 基于量化模块“relay.quantize”构建信息标注、阈值校准与量化图实现的框架流程, 并添加粒度属性以显式识别量化方式。其次, 针对预定义校准方法无法确定有效量化信息的问题, 对量化中的阈值校准、权重舍入进行调优, 提高量化后模型精度。实验采用ImageNet数据集对视觉网络进行测试, 针对MobileNetV1新量化方案将8 bit量化后模型精度损失降低到2.3%, 调优后该损失降低到0.7%, 实验结果表明多粒度量化框架可有效降低量化误差。

关键词: 模型量化, 模型部署, 模型压缩, 推理加速, 深度学习编译器

Abstract:

With the increasing demand for the deployment of large models by major manufacturers, the accuracy of the single quantization method of deep learning compiler Tensor Virtual Machine (TVM) has decreased, and this method is no longer sufficient to satisfy deployment requirements. Therefore, in this study, a flexible granularity model quantization framework is designed and constructed. This framework supports layer-wise and channel-wise quantization processes as well as the implementation of threshold search and adaptive rounding optimization algorithms. First, based on the quantization module ″relay.quantize″, a framework flow for information annotation, threshold calibration, and quantization graph realization is constructed, which includes granularity attributes to explicitly identify the quantization method. Second, fine-tuning is applied to the threshold calibration and weight rounding in quantization to address the issue of ineffective quantization information determination using predefined calibration methods, thereby improving the accuracy of the quantized model. Experiments are conducted using the ImageNet dataset to test visual networks. The results reveal that the new quantization scheme for MobileNetV1 reduces the loss of model accuracy to 2.3% after 8 bit quantization, and this loss is reduced to 0.7% after tuning. Hence, the multi-granularity quantization framework can effectively reduce the quantization error.

Key words: model quantization, model deployment, model compression, inference acceleration, deep learning compiler

魏铭康, 李嘉楠, 韩林, 高伟, 赵荣彩, 王洪生. 面向深度学习编译器的多粒度量化框架支持与优化[J]. 计算机工程, 2025, 51(5): 62-72.

WEI Mingkang, LI Jianan, HAN Lin, GAO Wei, ZHAO Rongcai, WANG Hongsheng. Support and Optimization of Multi-Granularity Quantization Framework for Deep Learning Compiler[J]. Computer Engineering, 2025, 51(5): 62-72.

https://www.ecice06.com/CN/Y2025/V51/I5/62

图/表 18

图1 量化与反量化示例

Fig.1 Example of quantization and dequantization

图2 量化粒度示例

Fig.2 Example of quantization granularity

图3 量化框架对比

Fig.3 Comparison of quantization frameworks

图4 量化流程

Fig.4 Quantization process

图5 量化粒度属性示例

Fig.5 Example of quantization granularity properties

图6 各类算子信息标注示例

Fig.6 Example of information annoatation for various operators

图7 阈值校准中的计算图重构示例

Fig.7 Example of computation graph reconstruction in threshold calibration

图8 各类算子量化图实现示例

Fig.8 Example of quantization graph realization for various operators

图9 阈值选择与精度损失的关系

Fig.9 Relationship between threshold selection and accuracy loss

图10 quantize算子转换示例

Fig.10 Example of quantize operator transformation

图11 模型压缩率

Fig.11 Model compression ratio

图12 不同量化粒度下各模型加速比

Fig.12 Speed-up ratio of various models under different quantization granularities

图13 对比MXNet的加速比

Fig.13 Speed-up ratio compared to MXNet

图14 对比PyTorch的加速比

Fig.14 Speed-up ratio compared to PyTorch

图15 调优层数与精度关系

Fig.15 Relationship between the number of fine-tuning layers and accuracy

参考文献 25

1	PÉREZ-GUERRERO C , CIPRIÁN-SÁNCHEZ J F , PALACIOS A , et al. Computer vision-based characterization of large-scale jet flames using a synthetic infrared image generation approach. Engineering Applications of Artificial Intelligence, 2024, 127, 107275. doi: 10.1016/j.engappai.2023.107275
2	EYIOKUR F I , KANTARCI A , ERAKıN M E , et al. A survey on computer vision based human analysis in the COVID-19 era. Image and Vision Computing, 2023, 130, 104610. doi: 10.1016/j.imavis.2022.104610
3	KHURANA D , KOLI A , KHATTER K , et al. Natural language processing: state of the art, current trends and challenges. Multimedia Tools and Applications, 2023, 82 (3): 3713- 3744. doi: 10.1007/s11042-022-13428-4
4	LIAO J W , ESKIMEZ S , LU L Y , et al. Improving readability for automatic speech recognition transcription. ACM Transactions on Asian and Low-Resource Language Information Processing, 2023, 22 (5): 1- 23.
5	LIU F Y, BUGLIARELLO E, PONTI E M, et al. Visually grounded reasoning across languages and cultures[EB/OL]. [2024-01-09]. https://arxiv.org/abs/2109.13238v2.
6	王军, 冯孙铖, 程勇. 深度学习的轻量化神经网络结构研究综述. 计算机工程, 2021, 47 (8): 1- 13. URL
	WANG J , FENG S C , CHENG Y . Survey of research on lightweight neural network structures for deep learning. Computer Engineering, 2021, 47 (8): 1- 13. URL
7	OGBOGU C O , ARKA A I , PFROMM L , et al. Accelerating graph neural network training on ReRAM-based PIM architectures via graph and model pruning. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2023, 42 (8): 2703- 2716. doi: 10.1109/TCAD.2022.3227879
8	STEWART J, MICHIELI U, OZAY M. Data-free model pruning at initialization via expanders[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2023: 4518-4523.
9	郭奕裕, 周箩鱼. 安全帽佩戴检测网络模型的轻量化设计. 计算机工程, 2023, 49 (4): 312- 320. URL
	GUO Y Y , ZHOU L Y . Lightweight design of safety helmet wearing detection network model. Computer Engineering, 2023, 49 (4): 312- 320. URL
10	GOU J P , XIONG X S , YU B S , et al. Multi-target knowledge distillation via student self-reflection. International Journal of Computer Vision, 2023, 131 (7): 1857- 1874. doi: 10.1007/s11263-023-01792-z
11	曹坪, 杨怀志, 薄一军, 等. 面向低质量裂缝图像的多知识蒸馏分类. 计算机工程, 2023, 49 (7): 204- 213. URL
	CAO P , YANG H Z , BO Y J , et al. Low-quality crack image classification with multi-knowledge distillation. Computer Engineering, 2023, 49 (7): 204- 213. URL
12	ZOU H , ZHANG C , LASAULCE S , et al. Goal-oriented quantization: analysis, design, and application to resource allocation. IEEE Journal on Selected Areas in Communications, 2022, 41 (1): 42- 54.
13	XU N J , CHEN X H , CAO Y L , et al. Hybrid post-training quantization for super-resolution neural network compression. IEEE Signal Processing Letters, 2023, 30, 379- 383.
14	巩杰, 赵烁, 何虎, 等. 基于FPGA的量化CNN加速系统设计. 计算机工程, 2022, 48 (3): 170-174, 196. URL
	GONG J , ZHAO S , HE H , et al. Design of quantized CNN acceleration system based on FPGA. Computer Engineering, 2022, 48 (3): 170-174, 196. URL
15	CHEN T Q, MOREAU T, JIANG Z H, et al. TVM[C]//Proceedings of the 13th USENIX Conference on Operating Systems Design and Implementation. New York, USA: ACM Press, 2018: 579-594.
16	LI M Z , LIU Y , LIU X Y , et al. The deep learning compiler: a comprehensive survey. IEEE Transactions on Parallel and Distributed Systems, 2020, 32 (3): 708- 727.
17	ZHANG H, XING M, WU Y, et al. Compiler technologies in deep learning co-design: a survey[EB/OL]. [2024-01-09]. https://spj.science.org/doi/10.34133/icomputing.0040.
18	ROESCH J, LYUBOMIRSKY S, WEBER L, et al. Relay: a new IR for machine learning frameworks[C]//Proceedings of the 2nd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages. New York, USA: ACM Press, 2018: 58-68.
19	JAIN A, BHATTACHARYA S, MASUDA M, et al. Efficient execution of quantized deep learning models: a compiler approach[EB/OL]. (2020-06-18)[2024-01-09]. https://doi.org/10.48550/arXiv.2006.10226.
20	KRIZHEVSKY A , SUTSKEVER I , HINTON G E . ImageNet classification with deep convolutional neural networks. Communications of the ACM, 2017, 60 (6): 84- 90.
21	HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2016: 770-778.
22	SZEGEDY C, VANHOUCKE V, IOFFE S, et al. Rethinking the inception architecture for computer vision[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2016: 2818-2826.
23	SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. (2015-04-10)[2024-01-09]. https://doi.org/10.48550/arXiv.1409.1556.
24	SANDLER M, HOWARD A, ZHU M L, et al. MobileNetV2: inverted residuals and linear bottlenecks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). Washington D. C., USA: IEEE Press, 2018: 4510-4520.
25	CHEN T, LI M, LI Y, et al. MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL]. (2015-12-03)[2024-01-09]. https://doi.org/10.48550/arXiv.1512.01274.

[1]	阳佩珉, 闵华松. 嵌入式肌电腕带实时采集与识别系统设计[J]. 计算机工程, 2025, 51(2): 259-268.
[2]	王月昊, 周若华. 低资源环境下的语音唤醒研究综述[J]. 计算机工程, 2025, 51(2): 35-53.
[3]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[4]	李宜亭, 屈丹, 杨绪魁, 张昊, 沈小龙. 基于分解门控注意力单元的高效Conformer模型[J]. 计算机工程, 2023, 49(5): 73-80.
[5]	郭奕裕, 周箩鱼. 安全帽佩戴检测网络模型的轻量化设计[J]. 计算机工程, 2023, 49(4): 312-320.
[6]	王国栋, 叶剑, 谢萦, 钱跃良. 基于梯度的自适应阈值结构化剪枝算法[J]. 计算机工程, 2022, 48(9): 113-120.
[7]	王士浩, 王中卿, 李寿山, 周国栋. 基于知识蒸馏与模型集成的事件论元抽取方法[J]. 计算机工程, 2022, 48(7): 97-103.
[8]	孔维刚, 李文婧, 王秋艳, 曹鹏程, 宋庆增. 基于改进YOLOv4算法的轻量化网络设计与实现[J]. 计算机工程, 2022, 48(3): 181-188.
[9]	巩杰, 赵烁, 何虎, 邓宁. 基于FPGA的量化CNN加速系统设计[J]. 计算机工程, 2022, 48(3): 170-174,196.
[10]	卢鹏, 万莹, 邹国良, 陈金宇, 郑宗生, 王振华. 基于自适应分层阈值判断的神经网络模型压缩[J]. 计算机工程, 2022, 48(1): 112-118,126.
[11]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[12]	王军, 冯孙铖, 程勇. 深度学习的轻量化神经网络结构研究综述[J]. 计算机工程, 2021, 47(8): 1-13.
[13]	张红梅, 严海兵, 张向利. 结合半波高斯量化与交替更新的神经网络压缩方法[J]. 计算机工程, 2021, 47(5): 80-87.
[14]	杨民杰, 梁亚玲, 杜明辉. 基于参数子空间和缩放因子的YOLO剪枝算法[J]. 计算机工程, 2021, 47(2): 111-117.
[15]	韦越, 陈世超, 朱凤华, 熊刚. 基于稀疏正则化的卷积神经网络模型剪枝方法[J]. 计算机工程, 2021, 47(10): 61-66.

选择文件类型/文献管理软件名称

选择包含的内容