Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering

   

Transplantation and Optimization of Sparse Matrix Template Library for Domestic Accelerators

  

  • Published:2025-10-17

面向国产加速器的稀疏矩阵模板库移植与优化

Abstract: The CUDA sparse matrix template library (CUTLASS-Sparse) in the CUDA linear algebra template library (CUTLASS) is used to build customizable and high-performance sparse matrix-dense matrix multiplication (SpMM) kernels, which play an important role in many fields such as scientific computing and deep learning. However, it is only implemented and optimized for NVIDIA GPUs and cannot be applied to domestic accelerators. To solve this problem, a transplantation and optimization scheme for CUTLASS-Sparse for domestic accelerators is proposed. In the transplantation stage, the data access module, data computation module and data write-back module are adapted to the hardware architecture of domestic accelerators. In the optimization stage, two shared memory data reordering algorithms, a data pipeline strategy based on data prefetching and register double buffering, and a data write-back strategy based on data aggregation are proposed to address the problems of high conflict rate of shared memory physical storage units (bank), low shared memory bandwidth utilization, low data pipeline parallelism and low data write-back efficiency. Experimental results show that all three optimization methods significantly improve the performance of the transplanted CUTLASS-Sparse. For TF32 and FP16 data types, the overall performance of the optimized CUTLASS-Sparse increases by an average of 30% and 115% compared to the unoptimized version, respectively. It reaches an average of 76% and 60% of the performance of CUTLASS-Sparse on NVIDIA GPU L20, respectively. Under two hardware versions, the performance of the transplanted and optimized CUTLASS-Sparse is on average 2.36 times and 3.09 times that of the SPARSE math library on domestic accelerator platforms, respectively. The experimental results verify the effectiveness of the transplantation and optimization scheme.

摘要: CUDA线性代数模板库(CUTLASS)中的CUDA稀疏矩阵模版库(CUTLASS-Sparse)用于构建可定制化且高性能的稀疏矩阵-稠密矩阵乘法(SpMM)内核,其在科学计算、深度学习等众多领域发挥着重要作用。但其仅针对NVIDIA GPU进行实现和优化,并不能应用于国产加速器。为了解决该问题,提出CUTLASS-Sparse面向国产加速器的移植与优化方案。在移植阶段,针对国产加速器硬件架构对数据访问模块、数据计算模块和数据写回模块进行适配。在优化阶段,针对其中存在的共享内存物理存储体(bank)冲突率高且共享内存带宽利用率低、数据流水线并行度低和数据写回效率低的问题,提出两种共享内存数据重排算法、基于数据预取与寄存器双缓冲的数据流水线策略和基于数据聚集的数据写回策略。实验结果表明,三种优化方法均显著提升了移植后的CUTLASS-Sparse的性能。对于TF32和FP16数据类型,经过优化后的CUTLASS-Sparse的整体性能相较于优化前分别平均提升了30%和115%,分别平均达到NVIDIA GPU L20上CUTLASS-Sparse性能的76%和60%。在两种硬件版本下,移植与优化后的CUTLASS-Sparse性能分别平均为国产加速器平台的SPARSE数学库性能的2.36倍和3.09倍。实验结果验证了移植与优化方案的有效性。