[1] 庞文豪,王嘉伦,翁楚良.GPGPU 和 CUDA 统一内存研究
现状综述[J/OL].计算机工程,1-22[2024-10-19].https://doi.
org/10.19678/j.issn.1000-3428.0068694.
PANG W H, WANG J L, WENG C L. Survey on G
PGPU and CUDA Unified Memory Research Status[J/
OL].Computer Engineering,1-22[2024-10-20].https://doi.
org/10.19678/j.issn.1000-3428.0068694.
[2] Multithreaded M , In P , Via C O R ,et al.NVIDIA
T ESLA : A U NIFIED G RAPHICS AND C OMPU
TING A RCHITECTURE COMPUTING ARCHITECT
URE. I TS SCALABLE PARALLEL ARRAY OF PR
OCESSORS IS[J]. 2008.
[3] Steffen M , Zambreno J .Improving SIMT Efficiency
of Global Rendering Algorithms with Architectural Su
pport for Dynamic Micro-Kernels[C]//IEEE/ACM Inter
national Symposium on Microarchitecture.ACM, 2010.
DOI:10.1109/MICRO.2010.45.
[4] Nugteren C , Braak G J V D , Corporaal H .Future
of GPGPU micro-architectural parameters[C]//Proceedin
gs of the Conference on Design, Automation and Test
in Europe.IEEE, 2013.DOI:10.7873/DATE.2013.089.
[5] Khorasani F , Gupta R , Bhuyan L N .Efficient warp
execution in presence of divergence with collaborative
context collection[C]//IEEE/ACM International Sympo
sium on Microarchitecture.IEEE, 2015:204-215.DOI:10.
1145/2830772.2830796.
[6] Minsoo,Rhu,Mattan,et al.CAPRI: Prediction of Compacti
on-Adequacy for Handling Control-Divergence in GPG
PU Architectures[J].Computer Architecture News, 201
2.
[7] 王旭昊,唐甜.一种源源编译控制流优化方法[J].航空计
算技术,2012,42(03):98-103.
WANG X H, TANG T. A Optimization Method of So
urce-to-Source Compiler Control Flow [J]. Aeronautica
l Computing Technique, 2012, 42(03): 98-103.
[8] Chen W K , Li B , Gupta R .Code Compaction of
Matching Single-Entry Multiple-Exit Regions[C]//Intern
ational symposium on static analysis.2003.
[9] Coutinho B , Sampaio D , Pereira F M Q ,et al.Dive
rgence Analysis and Optimizations[J].IEEE Computer
Society, 2011.DOI:10.1109/PACT.2011.63. [10] Saumya C , Sundararajah K , Kulkarni M .DARM:
Control-Flow Melding for SIMT Thread Divergence R
eduction -- Extended Version[J]. 2021.DOI:10.48550/ar
Xiv.2107.05681.
[11] Smith T F , Waterman M S .Identification of commo
n molecular subsequences.[J].Journal of Molecular Biol
ogy, 1981, 147(1):195-197.DOI:10.1016/0022-2836(81)9
0087-5.
[12] Lattner C , Adve V .LLVM: A Compilation Framewor
k for Lifelong Program Analysis & Transformation[J].I
EEE, 2004.DOI:10.1109/CGO.2004.1281665.
[13] LLVM. The LLVM Compiler Infrastructure[EB/OL]. [2
024-10-20]. https://llvm.org/.
[14] NVCC. NVIDIACUDA Toolkit Documentation[EB/OL].
[2024-10-20]. https://docs.nvidia.com/cuda/archive/11.2.
1/cuda-compiler-driver-nvcc/
[15] Roberto Castañeda Lozano, Carlsson M , Drejhammar
F ,et al.Constraint-Based Register Allocation and Inst
ruction Scheduling[C]//International Conference on Prin
ciples & Practice of Constraint Programming.2012.DOI:
10.1007/978-3-642-33558-7_54.
[16] 杨太龙,赵红朋,张磊.基于国产异构平台的奇异值分解法
[J].计算机工程, 2024(9).
YANG T L, ZHAO H P, ZHANG L. Singular Value
Decomposition Based on Domestic Heterogeneous Plat
forms [J]. Computer Engineering, 2024(9).
[17] Liu J , Wu Z , Yu D ,et al.HeterPS: Distributed Dee
p Learning With Reinforcement Learning Based Sched
uling in Heterogeneous Environments[J]. 2021.DOI:10.
48550/arXiv.2111.10635.
[18] 张军,魏继桢,沈凡凡,等. 基于 GPGPU-sim 的多 k
ernel 场景下 GPGPU 性能优化实验方法[J]. 实验技术
与管理, 2024, 41(7):87-93.
ZHANG J, WEI J Z, SHEN F F, et al. Experimental
method for optimizing GPGPU performance in a mul
tiple-kernel environment based on GPGPU-sim[J]. Exp
erimental Technology and Management, 2024, 41(7): 8
7-93. (in Chinese)
[19] AMD. AMD ROCm™ Documentation[EB/OL]. [2024-
10-20]. https://rocm.docs.amd.com/en/latest/.
[20] Cytron R , Ferrante J , Rosen B K ,et al.Efficiently c
omputing static single assignment form and the contro
l dependence graph[J].Acm Trans.prog.lang.syst, 1991,
13(4):451-490.DOI:10.1145/115372.115320.
[21] PassManager. llvm::PassManager< IRUnitT, AnalysisMa
nagerT, ExtraArgTs > Class Template Reference[EB/O
L]. [2024-10-21]. https://llvm.org/doxygen/classllvm_1_
1PassManager.html
[22] Huang J C , Leng T .Generalized loop-unrolling: a m
ethod for program speedup[J].IEEE, 1999.DOI:10.1109/
ASSET.1999.756775.
[23] Rodriguezcancio M , Combemale B , Baudry B .Auto
matic Microbenchmark Generation to Prevent Dead Co
de Elimination and Constant Folding[J].ACM, 2016.D
OI:10.1145/2970276.2970346.
[24] Jin Z , Vetter J S .A Benchmark Suite for Improving
Performance Portability of the SYCL Programming M
odel[C]//2023 IEEE International Symposium on Perfor
mance Analysis of Systems and Software (ISPASS).0
[2024-10-21].DOI:10.1109/ISPASS57527.2023.00041.
[25] Tensile. AMD ROCm™ Software [EB/OL]. [2024-10-2
1]. https://github.com/ROCm/Tensile.
|