Transplantation and Optimization of Sparse Matrix Template Library for Domestic Accelerators

doi:10.19678/j.issn.1000-3428.0252670

Abstract

Abstract: The CUDA sparse matrix template library (CUTLASS-Sparse) in the CUDA linear algebra template library (CUTLASS) is used to build customizable and high-performance sparse matrix-dense matrix multiplication (SpMM) kernels, which play an important role in many fields such as scientific computing and deep learning. However, it is only implemented and optimized for NVIDIA GPUs and cannot be applied to domestic accelerators. To solve this problem, a transplantation and optimization scheme for CUTLASS-Sparse for domestic accelerators is proposed. In the transplantation stage, the data access module, data computation module and data write-back module are adapted to the hardware architecture of domestic accelerators. In the optimization stage, two shared memory data reordering algorithms, a data pipeline strategy based on data prefetching and register double buffering, and a data write-back strategy based on data aggregation are proposed to address the problems of high conflict rate of shared memory physical storage units (bank), low shared memory bandwidth utilization, low data pipeline parallelism and low data write-back efficiency. Experimental results show that all three optimization methods significantly improve the performance of the transplanted CUTLASS-Sparse. For TF32 and FP16 data types, the overall performance of the optimized CUTLASS-Sparse increases by an average of 30% and 115% compared to the unoptimized version, respectively. It reaches an average of 76% and 60% of the performance of CUTLASS-Sparse on NVIDIA GPU L20, respectively. Under two hardware versions, the performance of the transplanted and optimized CUTLASS-Sparse is on average 2.36 times and 3.09 times that of the SPARSE math library on domestic accelerator platforms, respectively. The experimental results verify the effectiveness of the transplantation and optimization scheme.

摘要： CUDA线性代数模板库（CUTLASS）中的CUDA稀疏矩阵模版库（CUTLASS-Sparse）用于构建可定制化且高性能的稀疏矩阵-稠密矩阵乘法（SpMM）内核，其在科学计算、深度学习等众多领域发挥着重要作用。但其仅针对NVIDIA GPU进行实现和优化，并不能应用于国产加速器。为了解决该问题，提出CUTLASS-Sparse面向国产加速器的移植与优化方案。在移植阶段，针对国产加速器硬件架构对数据访问模块、数据计算模块和数据写回模块进行适配。在优化阶段，针对其中存在的共享内存物理存储体（bank）冲突率高且共享内存带宽利用率低、数据流水线并行度低和数据写回效率低的问题，提出两种共享内存数据重排算法、基于数据预取与寄存器双缓冲的数据流水线策略和基于数据聚集的数据写回策略。实验结果表明，三种优化方法均显著提升了移植后的CUTLASS-Sparse的性能。对于TF32和FP16数据类型，经过优化后的CUTLASS-Sparse的整体性能相较于优化前分别平均提升了30%和115%，分别平均达到NVIDIA GPU L20上CUTLASS-Sparse性能的76%和60%。在两种硬件版本下，移植与优化后的CUTLASS-Sparse性能分别平均为国产加速器平台的SPARSE数学库性能的2.36倍和3.09倍。实验结果验证了移植与优化方案的有效性。

Li Shiyou, Lian Demeng, Zhou Xin, Han Mengzhi. Transplantation and Optimization of Sparse Matrix Template Library for Domestic Accelerators[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252670.

李仕优, 廉德萌, 周鑫, 韩孟之. 面向国产加速器的稀疏矩阵模板库移植与优化[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252670.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252670

References

[1] FAN R, WANG W, CHU X W. Dtc-SpMM: bridging the gap in accelerating general sparse matrix multiplication with tensor cores[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York, USA: ACM Press, 2024: 253-267. [2] SONG Y C, WANG Y B, XIONG C Y, et al. An efficient sampling-based SpMM kernel for balancing accuracy and speed in GNN inference[C]//Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. Washington D. C., USA: IEEE Press, 2024: 468-475. [3] WU S W, SUN F, ZHANG W T, et al. Graph neural networks in recommender systems: a survey[J]. ACM Computing Surveys, 2022, 55(5): 1-37. [4] HOEFLER T, ALISTARH D, BEN-NUN T, et al. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks[J]. Journal of Machine Learning Research, 2021, 22(241): 1-124. [5] ANZT H, TOMOV S, DONGARRA J J. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product[C]//Proceedings of the Symposium on High Performance Computing. San Diego, USA: Society for Computer Simulation International, 2015: 75-82. [6] ZHAO H S, LI S, WANG J H, et al. Acc-SpMM: accelerating general-purpose sparse matrix-matrix multiplication with GPU tensor cores[C]//Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2025: 326-338. [7] YANG C, BULUC A, OWENS J D. Design principles for [1] FAN R, WANG W, CHU X W. Dtc-SpMM: bridging the gap in accelerating general sparse matrix multiplication with tensor cores[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York, USA: ACM Press, 2024: 253-267. [2] SONG Y C, WANG Y B, XIONG C Y, et al. An efficient sampling-based SpMM kernel for balancing accuracy and speed in GNN inference[C]//Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. Washington D. C., USA: IEEE Press, 2024: 468-475. [3] WU S W, SUN F, ZHANG W T, et al. Graph neural networks in recommender systems: a survey[J]. ACM Computing Surveys, 2022, 55(5): 1-37. [4] HOEFLER T, ALISTARH D, BEN-NUN T, et al. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks[J]. Journal of Machine Learning Research, 2021, 22(241): 1-124. [5] ANZT H, TOMOV S, DONGARRA J J. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product[C]//Proceedings of the Symposium on High Performance Computing. San Diego, USA: Society for Computer Simulation International, 2015: 75-82. [6] ZHAO H S, LI S, WANG J H, et al. Acc-SpMM: accelerating general-purpose sparse matrix-matrix multiplication with GPU tensor cores[C]//Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2025: 326-338. [7] YANG C, BULUC A, OWENS J D. Design principles for HYB-based SpMV on the new-generation Sunway architecture[J]. Computer Engineering & Science, 2023, 45(10): 1754-1762. [26] 姬晨晨, 陈永青, 韩孟之. 基于国产加速器的三维卷积前向算子优化[J]. 计算机工程, 2025, 51(2): 250-258. JI C C, CHEN Y Q, HAN M Z. Optimization of 3D convolutional forward operators based on domestic accelerators[J]. Computer Engineering, 2025, 51(2): 250-258. [27] 明刚, 张艳霞, 李旭胜, 等. 基于算子融合和向量化访存的大语言模型部署优化研究[C]//全国大模型与决策智能大会论文集. 中国杭州：中国指挥与控制学会, 2024:214-224. MING G, ZHANG Y X, LI X S, et al. Optimization of large language model deployment based on operator fusion and vectorized visits[C]//Proceedings of the China Conference on Large Foundation Model and Decision Intelligence. Hangzhou, China: Chinese Institute of Command and Control Press, 2024: 214-224. [28] DAVIS T A, HU Y F. The university of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1-25.

Please choose a citation manager

Content to export