| [1] FAN R, WANG W, CHU X W. Dtc-SpMM: bridging the gap in accelerating general sparse matrix multiplication with tensor cores[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York, USA: ACM Press, 2024: 253-267.
[2] SONG Y C, WANG Y B, XIONG C Y, et al. An efficient sampling-based SpMM kernel for balancing accuracy and speed in GNN inference[C]//Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. Washington D. C., USA: IEEE Press, 2024: 468-475.
[3] WU S W, SUN F, ZHANG W T, et al. Graph neural networks in recommender systems: a survey[J]. ACM Computing Surveys, 2022, 55(5): 1-37.
[4] HOEFLER T, ALISTARH D, BEN-NUN T, et al. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks[J]. Journal of Machine Learning Research, 2021, 22(241): 1-124.
[5] ANZT H, TOMOV S, DONGARRA J J. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product[C]//Proceedings of the Symposium on High Performance Computing. San Diego, USA: Society for Computer Simulation International, 2015: 75-82.
[6] ZHAO H S, LI S, WANG J H, et al. Acc-SpMM: accelerating general-purpose sparse matrix-matrix multiplication with GPU tensor cores[C]//Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2025: 326-338.
[7] YANG C, BULUC A, OWENS J D. Design principles for [1] FAN R, WANG W, CHU X W. Dtc-SpMM: bridging the gap in accelerating general sparse matrix multiplication with tensor cores[C]//Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 3. New York, USA: ACM Press, 2024: 253-267.
[2] SONG Y C, WANG Y B, XIONG C Y, et al. An efficient sampling-based SpMM kernel for balancing accuracy and speed in GNN inference[C]//Proceedings of the 2024 IEEE International Symposium on Parallel and Distributed Processing with Applications. Washington D. C., USA: IEEE Press, 2024: 468-475.
[3] WU S W, SUN F, ZHANG W T, et al. Graph neural networks in recommender systems: a survey[J]. ACM Computing Surveys, 2022, 55(5): 1-37.
[4] HOEFLER T, ALISTARH D, BEN-NUN T, et al. Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks[J]. Journal of Machine Learning Research, 2021, 22(241): 1-124.
[5] ANZT H, TOMOV S, DONGARRA J J. Accelerating the LOBPCG method on GPUs using a blocked sparse matrix vector product[C]//Proceedings of the Symposium on High Performance Computing. San Diego, USA: Society for Computer Simulation International, 2015: 75-82.
[6] ZHAO H S, LI S, WANG J H, et al. Acc-SpMM: accelerating general-purpose sparse matrix-matrix multiplication with GPU tensor cores[C]//Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2025: 326-338.
[7] YANG C, BULUC A, OWENS J D. Design principles for HYB-based SpMV on the new-generation Sunway architecture[J]. Computer Engineering & Science, 2023, 45(10): 1754-1762.
[26] 姬晨晨, 陈永青, 韩孟之. 基于国产加速器的三维卷积前向算子优化[J]. 计算机工程, 2025, 51(2): 250-258.
JI C C, CHEN Y Q, HAN M Z. Optimization of 3D convolutional forward operators based on domestic accelerators[J]. Computer Engineering, 2025, 51(2): 250-258.
[27] 明刚, 张艳霞, 李旭胜, 等. 基于算子融合和向量化访存的大语言模型部署优化研究[C]//全国大模型与决策智能大会论文集. 中国杭州: 中国指挥与控制学会, 2024:214-224.
MING G, ZHANG Y X, LI X S, et al. Optimization of large language model deployment based on operator fusion and vectorized visits[C]//Proceedings of the China Conference on Large Foundation Model and Decision Intelligence. Hangzhou, China: Chinese Institute of Command and Control Press, 2024: 214-224.
[28] DAVIS T A, HU Y F. The university of Florida sparse matrix collection[J]. ACM Transactions on Mathematical Software, 2011, 38(1): 1-25. |