Convolutional Operator Code Generator for a Domestic Accelerator

doi:10.19678/j.issn.1000-3428.0070632

Abstract

Abstract: In recent years, a domestic deep learning accelerator has developed rapidly. The hardware resources have been changing continuously and a series of tensor core instructions have been introduced, which makes it a huge challenge for developers to manually adapt and optimize the convolution operator in the accelerator. To this end, this paper proposes a convolution code generator for the domestic accelerator to simplify the adaptation and optimization process of the convolution operator. The generator provides configuration parameters as an external interface, and users only need to configure parameters to generate specific convolution operators. The generator itself consists of a three-layer architecture: the instruction layer encapsulates the underlying instructions and distinguishes them according to the hardware architecture; the component layer organizes the corresponding instructions according to the preset hardware architecture information, and provides highly abstract and reusable functional components from the perspective of thread blocks and thread bundles. The operator construction layer splices the functional components according to the implicit convolution algorithm and finally generates the convolution operator. To ensure the computing performance of the convolution operator, the generator is optimized from two aspects: using vectorization algorithm and thread partitioning algorithm to optimize the global memory access performance; using the transposition algorithm to transform the thread structure of the multiply-accumulate instruction to optimize the write-back performance. The test results show that the optimization algorithm of the generator can significantly improve the operator performance; under two hardware versions, the convolution operator performance of the NHWC storage layout reaches 95% and 90% of the official operator performance respectively. The generator provides a new solution for the adaptation and optimization of the convolution operator of domestic accelerators.

摘要： 近年来某国产深度学习加速器发展迅速，硬件资源持续变化并引入一系列张量核心指令，使得开发者在该加速器进行卷积算子的手工适配和优化面临巨大挑战。为此，本文提出了一种面向该国产加速器的卷积代码生成器，以简化卷积算子的适配与优化过程。该生成器提供配置参数作为对外接口，用户仅需配置参数即可生成特定的卷积算子。生成器本身由三层架构组成：指令层封装底层指令，并根据硬件架构进行区分；组件层根据预设的硬件架构信息组织相应指令，从线程块和线程束角度提供高度抽象且可复用的功能组件。算子构建层则按照隐式卷积算法拼接功能组件，最终生成卷积算子。为保证卷积算子的计算性能，生成器从两方面进行优化：使用向量化算法和线程划分算法优化全局访存性能；使用转置算法转化乘累加指令的线程结构以优化写回性能。测试结果表明：该生成器的优化算法可显著提升算子性能；在两种硬件版本下，NHWC存储布局的卷积算子性能分别达到官方算子性能的95%与90%。该生成器为国产加速器的卷积算子适配优化提供了一种全新的解决方案。

Wang XiaoLong, Wang JiaLiang, Ji Qing, Hou FengYao. Convolutional Operator Code Generator for a Domestic Accelerator[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0070632.

王小龙, 王家梁, 吉青, 侯丰尧. 面向某国产加速器的卷积算子代码生成器[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0070632.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0070632

References

[1] Zhao G, Sun N, Shen S, et al. GPU-Accelerated Target Strength Prediction Based on Multiresolution Shooting and Bouncing Ray Method[J]. Applied Sciences, 2022, 12(12): 6119.
[2] Golosio B, Villamar J, Tiddia G, et al. Runtime Construction of Large-Scale Spiking Neuronal Network Models on GPU Devices[J]. Applied Sciences, 2023, 13(17): 9598.
[3] Hu Y, Liu Y, Liu Z. A survey on convolutional neural network accelerators: GPU, FPGA and ASIC[C]//2022 14th International Conference on Computer Research and Development (ICCRD). IEEE, 2022: 100-107.
[4] Chen Y, Dai X, Liu M, et al. Dynamic convolution: Attention over convolution kernels[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 11030-11039.
[5] Kwon H, Chatarasi P, Sarkar V, et al. Maestro: A data-centric approach to understand reuse, performance, and hardware cost of dnn mappings[J]. IEEE micro, 2020, 40(3): 20-29.
[6] Xie X, Lin J, Wang Z, et al. An efficient and flexible accelerator design for sparse convolutional neural networks[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2021, 68(7): 2936-2949.
[7] Jia Z, Maggioni M, Staiger B, et al. Dissecting the NVIDIA volta GPU architecture via microbenchmarking[J]. arXiv preprint arXiv:1804.06826,2018.
[8] Alwan E H, Ketran R M, Hussein I A. A Comprehensive Survey on Loop Unrolling Technique In Code Optimization[J]. Journal of University of Babylon for Pure and Applied Sciences, 2024: 108-117.
[9] Daghaghi S, Meisburger N, Zhao M, et al. Accelerating slide deep learning on modern cpus: Vectorization, quantizations, memory optimizations, and more[J]. Proceedings of Machine Learning and Systems, 2021, 3: 156-166.
[10] 庞文豪,王嘉伦,翁楚良. GPGPU和CUDA统一内存研究现状综述[J]. 计算机工程,2024,50(12):1-15. Pang Wenhao, Wang Jialun, Weng Chuliang. Survey on GPU and CUDA Unified Memory Research Status[J].Computer Engineering, 2024,50(12), 1-15. (in Chinese)
[11] 曹义魁, 陆忠华, 张鉴, 等. 面向国产加速器的 CFD 核心算法并行优化[J]. 数据与计算发展前沿, 2021, 3(4): 93-103. Cao Yikui,Lu Zhonghua,Zhang Jian,Liu Xiazhen,Y uan Wu,Liang Shan. Parallel Optimization of CFD Core Algorithms Based on Domestic Processor[J]. Fro ntiers of Data and Computing, 2021, 3(4): 93-103 (in Chinese).
[12] Barca G M J. COMP4300/8300 Parallel Systems Introduction to GPU Architecture & Programming[J]. 2023.
[13] Sun W, Li A, Geng T, et al. Dissecting tensor cores via microbenchmarks: Latency, throughput and numeric behaviors[J]. IEEE Transactions on Parallel and Distributed Systems, 2022, 34(1): 246-261.
[14] Sun W, Li A, Stuijk S, et al. How much can we gain from Tensor Kernel Fusion on GPUs?[J]. IEEE Access, 2024.
[15] Chen T, Moreau T, Jiang Z, et al. {TVM}: An automated {End-to-End} optimizing compiler for deep learning[C]//13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18). 2018: 578-594.
[16] Katel N, Khandelwal V, Bondhugula U. MLIR-based code generation for GPU tensor cores[C]//Proceedings of the 31st ACM SIGPLAN International Conference on Compiler Construction. 2022: 117-128.
[17] Tripathi M. Analysis of convolutional neural network based image classification techniques[J]. Journal of Innovative Image Processing (JIIP), 2021, 3(02): 100-117.
[18] Zhang Z, Zhang P, Xu Z, et al. Im2col-Winograd: An Efficient and Flexible Fused-Winograd Convolution for NHWC Format on GPUs[C]//Proceedings of the 53rd International Conference on Parallel Processing. 2024: 1072-1081.
[19] Higham N J, Mary T. Mixed precision algorithms in numerical linear algebra[J]. Acta Numerica, 2022, 31: 347-414.
[20] Xu R, Ma S, Guo Y. Performance analysis of different convolution algorithms in GPU environment[C]//2018 IEEE International Conference on Networking, Architecture and Storage (NAS). IEEE, 2018: 1-10.
[21] Shevgunov T, Efimov E, Guschina O. Estimation of a Spectral Correlation Function Using a Time-Smoothing Cyclic Periodogram and FFT Interpolation—2N-FFT Algorithm[J]. Sensors, 2022, 23(1): 215.
[22] Gan T, Libo H. Review of winograd fast convolution technique research[J]. Journal of Frontiers of Computer Science & Technology, 2022, 16(5): 959.
[23] Nakasato N. A fast GEMM implementation on the Cypress GPU[J]. ACM SIGMETRICS Performance Evaluation Review, 2011, 38(4): 50-55.
[24] 李茂文,曲国远,魏大洲,等.面向 GPU 计算平台的神经网络卷积性能优化 [J]. 计算机研究与发展,2022,59(06):1181-1191. Li Maowen, Qu Guoyuan, Wei Dazhou, et al.Optimization of Convolutional Performance of NeuralNetworks for GPU Computing Platforms [J]. ComputerResearch and Development, 2022, 59 (06): 1181-1191 (inChinese).
[25] Korch M, Raithel P, Werner T. Implementation and Optimization of a 1D2V PIC Method for Nonlinear Kinetic Models on GPUs[C]//2020 28th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP). IEEE, 2020: 30-37.
[26] Zachariadis O, Satpute N, Gómez-Luna J, et al. Accelerating sparse matrix–matrix multiplication with GPU Tensor Cores[J]. Computers & Electrical Engineering, 2020, 88: 106848.
[27] Markidis S, Der Chien S W, Laure E, et al. Nvidia tensor core programmability, performance & precision[C]//2018IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 2018: 522-531.
[28] Nematollahi N, Sadrosadati M, Falahati H, et al. Efficient nearest-neighbor data sharing in GPUs[J]. ACM Transactions on Architecture and Code Optimization (TACO), 2020, 18(1): 1-26.
[29] Basso P M, dos Santos F F, Rech P. Impact of tensor cores and mixed precision on the reliability of matrix multiplication in GPUs[J]. IEEE Transactions on Nuclear Science, 2020, 67(7): 1560-1565.
[30] Willemsen F J, Schoonhoven R, Filipovič J, et al. A methodology for comparing optimization algorithms for auto-tuning[J]. Future Generation Computer Systems, 2024.
[31] LIU Z, LI C, TIAN X, et al. MVSim: A fast, scalable and accurate architecture simulator for VLIW multi-core vector processors[J]. Computer Engineering & Science, 2024, 46(02): 191.
[32] Ito Y, Nakano K. A GPU implementation of dynamic programming for the optimal polygon triangulation[J]. IEICE Transactions on Information and Systems, 2013, 96(12): 2596-2603.

Please choose a citation manager

Content to export