Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Convolutional Operator Code Generator for a Domestic Accelerator

  

  • Published:2025-04-10

面向某国产加速器的卷积算子代码生成器

Abstract: In recent years, a domestic deep learning accelerator has developed rapidly. The hardware resources have been changing continuously and a series of tensor core instructions have been introduced, which makes it a huge challenge for developers to manually adapt and optimize the convolution operator in the accelerator. To this end, this paper proposes a convolution code generator for the domestic accelerator to simplify the adaptation and optimization process of the convolution operator. The generator provides configuration parameters as an external interface, and users only need to configure parameters to generate specific convolution operators. The generator itself consists of a three-layer architecture: the instruction layer encapsulates the underlying instructions and distinguishes them according to the hardware architecture; the component layer organizes the corresponding instructions according to the preset hardware architecture information, and provides highly abstract and reusable functional components from the perspective of thread blocks and thread bundles. The operator construction layer splices the functional components according to the implicit convolution algorithm and finally generates the convolution operator. To ensure the computing performance of the convolution operator, the generator is optimized from two aspects: using vectorization algorithm and thread partitioning algorithm to optimize the global memory access performance; using the transposition algorithm to transform the thread structure of the multiply-accumulate instruction to optimize the write-back performance. The test results show that the optimization algorithm of the generator can significantly improve the operator performance; under two hardware versions, the convolution operator performance of the NHWC storage layout reaches 95% and 90% of the official operator performance respectively. The generator provides a new solution for the adaptation and optimization of the convolution operator of domestic accelerators.

摘要: 近年来某国产深度学习加速器发展迅速,硬件资源持续变化并引入一系列张量核心指令,使得开发者在该加速器进行卷积算子的手工适配和优化面临巨大挑战。为此,本文提出了一种面向该国产加速器的卷积代码生成器,以简化卷积算子的适配与优化过程。该生成器提供配置参数作为对外接口,用户仅需配置参数即可生成特定的卷积算子。生成器本身由三层架构组成:指令层封装底层指令,并根据硬件架构进行区分;组件层根据预设的硬件架构信息组织相应指令,从线程块和线程束角度提供高度抽象且可复用的功能组件。算子构建层则按照隐式卷积算法拼接功能组件,最终生成卷积算子。为保证卷积算子的计算性能,生成器从两方面进行优化:使用向量化算法和线程划分算法优化全局访存性能;使用转置算法转化乘累加指令的线程结构以优化写回性能。测试结果表明:该生成器的优化算法可显著提升算子性能;在两种硬件版本下,NHWC存储布局的卷积算子性能分别达到官方算子性能的95%与90%。该生成器为国产加速器的卷积算子适配优化提供了一种全新的解决方案。