[1] Zhao G, Sun N, Shen S, et al. GPU-Accelerated Target
Strength Prediction Based on Multiresolution Shooting
and Bouncing Ray Method[J]. Applied Sciences, 2022,
12(12): 6119.
[2] Golosio B, Villamar J, Tiddia G, et al. Runtime
Construction of Large-Scale Spiking Neuronal Network
Models on GPU Devices[J]. Applied Sciences, 2023,
13(17): 9598.
[3] Hu Y, Liu Y, Liu Z. A survey on convolutional neural
network accelerators: GPU, FPGA and ASIC[C]//2022
14th International Conference on Computer Research and
Development (ICCRD). IEEE, 2022: 100-107.
[4] Chen Y, Dai X, Liu M, et al. Dynamic convolution:
Attention over convolution kernels[C]//Proceedings of the
IEEE/CVF conference on computer vision and pattern
recognition. 2020: 11030-11039.
[5] Kwon H, Chatarasi P, Sarkar V, et al. Maestro: A
data-centric approach to understand reuse, performance,
and hardware cost of dnn mappings[J]. IEEE micro, 2020,
40(3): 20-29.
[6] Xie X, Lin J, Wang Z, et al. An efficient and flexible
accelerator design for sparse convolutional neural
networks[J]. IEEE Transactions on Circuits and Systems I:
Regular Papers, 2021, 68(7): 2936-2949.
[7] Jia Z, Maggioni M, Staiger B, et al. Dissecting the
NVIDIA volta GPU architecture via
microbenchmarking[J]. arXiv preprint arXiv:1804.06826,2018.
[8] Alwan E H, Ketran R M, Hussein I A. A Comprehensive
Survey on Loop Unrolling Technique In Code
Optimization[J]. Journal of University of Babylon for
Pure and Applied Sciences, 2024: 108-117.
[9] Daghaghi S, Meisburger N, Zhao M, et al. Accelerating
slide deep learning on modern cpus: Vectorization,
quantizations, memory optimizations, and more[J].
Proceedings of Machine Learning and Systems, 2021, 3:
156-166.
[10] 庞文豪,王嘉伦,翁楚良. GPGPU和CUDA统一内存研究
现状综述[J]. 计算机工程,2024,50(12):1-15.
Pang Wenhao, Wang Jialun, Weng Chuliang. Survey on
GPU and CUDA Unified Memory Research
Status[J].Computer Engineering, 2024,50(12), 1-15. (in
Chinese)
[11] 曹义魁, 陆忠华, 张鉴, 等. 面向国产加速器的 CFD
核心算法并行优化[J]. 数据与计算发展前沿, 2021, 3(4):
93-103.
Cao Yikui,Lu Zhonghua,Zhang Jian,Liu Xiazhen,Y uan
Wu,Liang Shan. Parallel Optimization of CFD Core
Algorithms Based on Domestic Processor[J]. Fro ntiers of
Data and Computing, 2021, 3(4): 93-103 (in Chinese).
[12] Barca G M J. COMP4300/8300 Parallel Systems
Introduction to GPU Architecture & Programming[J].
2023.
[13] Sun W, Li A, Geng T, et al. Dissecting tensor cores via
microbenchmarks: Latency, throughput and numeric
behaviors[J]. IEEE Transactions on Parallel and
Distributed Systems, 2022, 34(1): 246-261.
[14] Sun W, Li A, Stuijk S, et al. How much can we gain from
Tensor Kernel Fusion on GPUs?[J]. IEEE Access, 2024.
[15] Chen T, Moreau T, Jiang Z, et al. {TVM}: An automated
{End-to-End} optimizing compiler for deep
learning[C]//13th USENIX Symposium on Operating
Systems Design and Implementation (OSDI 18). 2018:
578-594.
[16] Katel N, Khandelwal V, Bondhugula U. MLIR-based code
generation for GPU tensor cores[C]//Proceedings of the
31st ACM SIGPLAN International Conference on
Compiler Construction. 2022: 117-128.
[17] Tripathi M. Analysis of convolutional neural network
based image classification techniques[J]. Journal of
Innovative Image Processing (JIIP), 2021, 3(02): 100-117.
[18] Zhang Z, Zhang P, Xu Z, et al. Im2col-Winograd: An
Efficient and Flexible Fused-Winograd Convolution for
NHWC Format on GPUs[C]//Proceedings of the 53rd
International Conference on Parallel Processing. 2024:
1072-1081.
[19] Higham N J, Mary T. Mixed precision algorithms in
numerical linear algebra[J]. Acta Numerica, 2022, 31:
347-414.
[20] Xu R, Ma S, Guo Y. Performance analysis of different
convolution algorithms in GPU environment[C]//2018
IEEE International Conference on Networking,
Architecture and Storage (NAS). IEEE, 2018: 1-10.
[21] Shevgunov T, Efimov E, Guschina O. Estimation of a
Spectral Correlation Function Using a Time-Smoothing
Cyclic Periodogram and FFT Interpolation—2N-FFT
Algorithm[J]. Sensors, 2022, 23(1): 215.
[22] Gan T, Libo H. Review of winograd fast convolution
technique research[J]. Journal of Frontiers of Computer
Science & Technology, 2022, 16(5): 959.
[23] Nakasato N. A fast GEMM implementation on the Cypress
GPU[J]. ACM SIGMETRICS Performance Evaluation
Review, 2011, 38(4): 50-55.
[24] 李茂文,曲国远,魏大洲,等.面向 GPU 计算平台的神经网
络 卷 积 性 能 优 化 [J]. 计算机研究与发
展,2022,59(06):1181-1191.
Li Maowen, Qu Guoyuan, Wei Dazhou, et al.Optimization
of Convolutional Performance of NeuralNetworks for
GPU Computing Platforms [J]. ComputerResearch and
Development, 2022, 59 (06): 1181-1191 (inChinese).
[25] Korch M, Raithel P, Werner T. Implementation and
Optimization of a 1D2V PIC Method for Nonlinear
Kinetic Models on GPUs[C]//2020 28th Euromicro
International Conference on Parallel, Distributed and
Network-Based Processing (PDP). IEEE, 2020: 30-37.
[26] Zachariadis O, Satpute N, Gómez-Luna J, et al.
Accelerating sparse matrix–matrix multiplication with
GPU Tensor Cores[J]. Computers & Electrical
Engineering, 2020, 88: 106848.
[27] Markidis S, Der Chien S W, Laure E, et al. Nvidia tensor
core programmability, performance & precision[C]//2018IEEE international parallel and distributed processing
symposium workshops (IPDPSW). IEEE, 2018: 522-531.
[28] Nematollahi N, Sadrosadati M, Falahati H, et al. Efficient
nearest-neighbor data sharing in GPUs[J]. ACM
Transactions on Architecture and Code Optimization
(TACO), 2020, 18(1): 1-26.
[29] Basso P M, dos Santos F F, Rech P. Impact of tensor cores
and mixed precision on the reliability of matrix
multiplication in GPUs[J]. IEEE Transactions on Nuclear
Science, 2020, 67(7): 1560-1565.
[30] Willemsen F J, Schoonhoven R, Filipovič J, et al. A
methodology for comparing optimization algorithms for
auto-tuning[J]. Future Generation Computer Systems,
2024.
[31] LIU Z, LI C, TIAN X, et al. MVSim: A fast, scalable and
accurate architecture simulator for VLIW multi-core
vector processors[J]. Computer Engineering & Science,
2024, 46(02): 191.
[32] Ito Y, Nakano K. A GPU implementation of dynamic
programming for the optimal polygon triangulation[J].
IEICE Transactions on Information and Systems, 2013,
96(12): 2596-2603.
|