[1] 中关村云计算产业联盟,汉能投资集团. 2022 年中国云
计算生态蓝皮书[R]. 2022:44–50.
ZhongGuanCun Cloud Computing industry Alliance,
HINA. Cloud Computing Ecosystem Report[R]. 2022:44–50.
[2] NVIDIA Corporation. How to Optimize Data Transfers
in CUDA C/C++ [EB/OL]. [2023-10-16].
https://developer.nvidia.com/blog/how-optimize-data-tra
nsfers-cuda-cc/.
[3] NVIDIA Corporation. NVIDIA Tesla P100[EB/OL].
[2023-10-20].
https://images.nvidia.com/content/pdf/tesla/whitepaper/p
ascal-architecture-whitepaper.pdf.
[4] NVIDIA Corporation. NVIDIA TESLA V100 GPU
ARCHITECTURE [EB/OL]. [2023-10-22].
https://images.nvidia.com/content/volta-architecture/pdf/
volta-architecture-whitepaper.pdf.
[5] Arm Developer. Arm Mali GPU OpenCL Developer
Guide[EB/OL]. [2023-10-21].
https://developer.arm.com/documentation/101574/0502/
OpenCL-2-0/Shared-virtual-memory.
[6] Ben Ashbaugh. Cl Intel Unified Shared Memory[EB/OL].[2023-10-29].
https://registry.khronos.org/OpenCL/extensions/intel/cl_
intel_unified_shared_memory.html.
[7] Jon Peddie Research. Q1’22 saw a decline in GPU and
PC shipments quarter-to-quarter[EB/OL]. [2023-10-23].
https://www.jonpeddie.com/news/q122-saw-a-decline-in-
gpu-and-pc-shipments-quarter-to-quarter/l.
[8] METAX-TECH. Metax-tech[EB/OL]. [2023-10-21].
https://www.metax-tech.com/.
[9] MTHREADS. Mthreads[EB/OL]. [2023-10-25].
https://www.mthreads.com/.
[10] BIRENTECH. Birentech[EB/OL]. [2023-10-19].
https://www.birentech.com/.
[11] BIREN TECHNOLOGY. BR100[EB/OL]. [2023-10-20].
https://www.birentech.com/News_details/16125806.html
.
[12] ILUVATAR. Iluvatar[EB/OL]. [2023-10-20].
https://www.iluvatar.com/.
[13] CAMBRICON. Cambricon[EB/OL]. [2023-10-20].
https://www.cambricon.com/.
[14] HYGON. Hygon[EB/OL]. [2023-10-20].
https://www.hygon.cn/product/accelerator.
[15] HYGON. Hygon[EB/OL]. [2023-10-20].
https://www.hygon.cn/product/accelerator.
[16] SIETIUM. Sietium[EB/OL]. [2023-10-20].
https://www.sietium.com/.
[17] ENFLAME-TECH. Enflame-tech[EB/OL]. [2023-10-20].
https://www.enflame-tech.com/.
[18] DENGLINAI. Denglinai[EB/OL]. [2023-10-20].
https://denglinai.com/.
[19] INNOSILICON. Innosilicon[EB/OL]. [2023-10-20].
https://www.innosilicon.cn/.
[20] ZHAOXIN. Zhaoxin[EB/OL]. [2023-10-22].
https://www.zhaoxin.com/.
[21] CSIC-711. Csic-711[EB/OL]. [2023-10-20].
http://www.csic-711.com/ch/main.asp.
[22] ICUBECORP. Icubecorp[EB/OL]. [2023-10-18].
http://www.icubecorp.cn/.
[23] NVIDIA Corporation. Cuda toolkit,Develop, Optimize
and Deploy GPU-Accelerated Apps[EB/OL].
[2023-10-20]. https://developer.nvidia.com/cuda-toolkit.
[24] OpenCL. OPEN STANDARD FOR PARALLEL
PROGRAMMING OF HETEROGENEOUS
SYSTEMS[EB/OL]. [2023-10-20].
https://www.khronos.org/opencl/.
[25] AMD. AMD ROCm™ Documentation[EB/OL].
[2023-10-20]. https://rocm.docs.amd.com/en/latest/.
[26] NVIDIA Corporation. An Easy Introduction to CUDA C
and C++ [EB/OL]. [2023-10-21].
https://developer.nvidia.com/blog/easy-introduction-cud
a-c-and-c/.
[27] LLVM. The LLVM Compiler Infrastructure[EB/OL].
[2023-10-17]. https://llvm.org/.
[28] NVIDIA Corporation. cuBLAS[EB/OL]. [2023-10-26].
https://docs.nvidia.com/cuda/cublas/.
[29] NVIDIA Corporation. cuFFT API Reference[EB/OL].
[2023-10-22].
https://docs.nvidia.com/cuda/cufft/index.html.
[30] NVIDIA Corporation. cuRAND[EB/OL]. [2023-10-15].
https://docs.nvidia.com/cuda/curand/index.html.
[31] NVIDIA Corporation. NVIDIA 2D Image And Signal
Performance Primitives (NPP) [EB/OL]. [2023-10-20].
https://docs.nvidia.com/cuda/npp/index.html.
[32] NVIDIA Corporation. Nvidia
Cuda-C-Programming[EB/OL]. [2023-10-19].
https://docs.nvidia.com/cuda/cuda-c-programming-guide
/index.html.
[33] LI Z, PENG B, WENG C. Xeflow: Streamlining
inter-processor pipeline execution for the discrete
cpu-gpu platform[J]. IEEE Transactions on Computers,
2020, 69(6): 819-831.
[34] KWON Y, RHU M. A case for memory-centric hpc
system architecture for training deep neural networks[J].
IEEE computer architecture letters, 2018, 17(2):
134-138.
[35] MENG C, SUN M, YANG J, et al. Training deeper
models by gpu memory optimization on
tensorflow[C]//Proc. of ML Systems Workshop in NIPS:
volume 7. 2017.
[36] RHU M, GIMELSHEIN N, CLEMONS J, et al. vdnn:
Virtualized deep neural networks for scalable,
memory-efficient neural network design [C]//2016 49th
Annual IEEE/ACM International Symposium on
Microarchitecture (MICRO). IEEE, 2016: 1-13.
[37] RHU M, O’CONNOR M, CHATTERJEE N, et al.
Compressing dma engine: Leveraging activation sparsity
for training deep neural networks [C]//2018 IEEE
International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2018: 78-91.
[38] ZHENG T, NELLANS D, ZULFIQAR A, et al. Towardshigh performance paged memory for gpus[C]//2016
IEEE International Symposium on High Performance
Computer Architecture (HPCA). IEEE, 2016: 345- 357.
[39] NVIDIA Corporation. NVIDIA Pascal
Architecture[EB/OL]. [2023-10-20].
https://www.nvidia.com/en-us/data-center/pascal-gpu-ar
chitecture/.
[40] AGARWAL N, NELLANS D, STEPHENSON M, et al.
Page placement strategies for gpus within heterogeneous
memory systems[C]// Proceedings of the Twentieth
International Conference on Architectural Support for
Programming Languages and Operating Systems. 2015:
607- 618.
[41] AMD. AMD GRAPHICS CORES NEXT (GCN)
ARCHITECTURE [EB/OL]. [2023-10-20].
https://www.techpowerup.com/gpu-specs/docs/amd-gcn1
-architecture.pdf.
[42] Mark Harris. Unified Memory in CUDA 6[EB/OL].
[2023-10-23].
https://developer.nvidia.com/blog/unified-memory-in-cu
da-6/.
[43] NVIDIA Corporation. Cuda c++ best practices
guide,Zero Copy [EB/OL]. [2023-10-13].
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide
/index.html#zero-copy.
[44] NVIDIA Corporation. Peer-to-Peer Unified Virtual
Addressing[EB/OL]. [2023-10-17].
https://developer.download.nvidia.com/CUDA/training/c
uda_webinars_GPUDirect_uva.pdf.
[45] NVIDIA Corporation. Cuda c++ best practices
guide,Unified Virtual Addressing[EB/OL]. [2023-10-17].
https://docs.nvidia.com/cuda/cuda-c-best-practices-guide
/index.html#unified-virtual-addressing.
[46] NVIDIA Corporation. CUDA Toolkit 4.0[EB/OL].
[2023-10-21].
https://developer.nvidia.com/cuda-toolkit-40.
[47] SPECIFICATION B. Pci express® base specification
revision 3.0[J]. 2002.
[48] NVIDIA Corporation. Nvidia
Cuda-C-Programming-Performance Tuning[EB/OL].
[2023-10-18].
https://docs.nvidia.com/cuda/cuda-c-programming-guide
/index.html#performance-tuning.
[49] NVIDIA Corporation. Profiler User’s Guide[EB/OL].
[2023-10-17].
https://docs.nvidia.com/cuda/profiler-users-guide/index.
html.
[50] NVIDIA Corporation. NVIDIA Nsight Systems[EB/OL].
[2023-10-19].
https://developer.nvidia.com/nsight-systems.
[51] NVIDIA Corporation. NVIDIA Nsight Compute[EB/OL].
[2023-10-22].
https://developer.nvidia.com/nsight-compute.
[52] NVIDIA Corporation. CUDA Runtime API[EB/OL].
[2023-10-25].
https://docs.nvidia.com/cuda/cuda-runtime-api/.
[53] JUNG J, KIM J, LEE J. Deepum: Tensor migration and
prefetching in unified memory[C]//Proceedings of the
28th ACM International Conference on Architectural
Support for Programming Languages and Operating
Systems, Volume 2. 2023: 207-221.
[54] GELADO I, STONE J E, CABEZAS J, et al. An
asymmetric distributed shared memory model for
heterogeneous parallel systems[C]//Proceedings of the
fifteenth International Conference on Architectural
support for programming languages and operating
systems. 2010: 347- 358.
[55] JABLIN T B, PRABHU P, JABLIN J A, et al. Automatic
cpu-gpu communication management and
optimization[C]//Proceedings of the 32nd ACM
SIGPLAN conference on Programming language design
and implementation. 2011: 142-151.
[56] JABLIN T B, JABLIN J A, PRABHU P, et al.
Dynamically managed data for cpu-gpu
architectures[C]//Proceedings of the Tenth International
Symposium on Code Generation and Optimization. 2012:
165-174.
[57] PAI S, GOVINDARAJAN R, THAZHUTHAVEETIL M J.
Fast and efficient automatic memory management for
gpus using compiler-assisted runtime coherence
scheme[C]//Proceedings of the 21st international
conference on Parallel architectures and compilation
techniques. 2012: 33- 42.
[58] ALSABER N, KULKARNI M. Semcache:
Semantics-aware caching for efficient gpu
offloading[C]//Proceedings of the 27th international
ACM conference on International conference on
supercomputing. 2013: 421- 432.
[59] WANG L, YE J, ZHAO Y, et al. Superneurons: Dynamic
gpu memory management for training deep neuralnetworks[C]//Proceedings of the 23rd ACM SIGPLAN
symposium on principles and practice of parallel
programming. 2018: 41-53.
[60] 裴威 , 李战怀 , 潘巍 . GPU 数据库核心技术综述 [J]. 软
件学报 , 2021, 32(3): 859-885.
Pei W, Li ZH, Pan W. Survey of key technologies in GPU
database system[J]. Ruan Jian Xue Bao/Journal of Software,
2021,32(3):859−885
[61] 李志方 . 异构体系结构上的数据处理加速 [D]. 上海:
华东师范大学 , 2021.
LI Zhifang. Accelerating Data Processing on the
Heterogeneous Architecture[D]. Shanghai, China: East China
Normal University, 2021.
[62] DEAN J, GHEMAWAT S. Mapreduce: simplified data
processing on large clusters[J]. Communications of the
ACM, 2008, 51(1): 107-113.
[63] HE B, FANG W, LUO Q, et al. Mars: a mapreduce
framework on graphics processors[C]//Proceedings of
the 17th international conference on Parallel
architectures and compilation techniques. 2008: 260-269.
[64] Nikolay Sakharnykh. UNIFIED MEMORY ON PASCAL
AND VOLTA [EB/OL]. [2023-10-20].
https://on-demand.gputechconf.com/gtc/2017/presentatio
n/s7285-nikolay-sakharnykh-unified-memory-on-pascal-
and-volta.pdf.
[65] Nikolay Sakharnykh. Beyond GPU Memory Limits with
Unified Memory on Pascal[EB/OL]. [2023-10-20].
https://developer.nvidia.com/blog/beyond-gpu-memory-l
imits-unified-memory-pascal/.
[66] NVIDIA Corporation. Maximizing Unified Memory
Performance in CUDA[EB/OL]. [2023-10-20].
https://developer.nvidia.com/blog/maximizing-unified-m
emory-performance-cuda/.
[67] Nikolay Sakharnykh. EVERYTHING YOU NEED TO
KNOW ABOUT UNIFIED MEMORY[EB/OL].
[2023-10-20].
https://on-demand.gputechconf.com/gtc/2018/presentatio
n/s8430-everything-you-need-to-know-about-unified-me
mory.pdf.
[68] JOG A, KAYIRAN O, CHIDAMBARAM
NACHIAPPAN N, et al. Owl: cooperative thread array
aware scheduling techniques for improving gpgpu
performance[J]. ACM SIGPLAN Notices, 2013, 48(4):
395-406.
[69] JOHNSON T L, MERTEN M C, HWU W M W. Run-time
spatial locality detection and
optimization[C]//Proceedings of 30th Annual
International Symposium on Microarchitecture. IEEE,
1997: 57-64.
[70] JOG A, KAYIRAN O, MISHRA A K, et al. Orchestrated
scheduling and prefetching for gpgpus[C]//Proceedings
of the 40th Annual International Symposium on
Computer Architecture. 2013: 332-343.
[71] GANGULY D, ZHANG Z, YANG J, et al. Interplay
between hardware prefetcher and page eviction policy in
cpu-gpu unified virtual memory[C]//Proceedings of the
46th International Symposium on Computer Architecture.
2019: 224-235.
[72] GANGULY D, ZHANG Z, YANG J, et al. Adaptive page
migration for irregular data-intensive applications under
gpu memory oversubscription[C]//2020 IEEE
International Parallel and Distributed Processing
Symposium (IPDPS). IEEE, 2020: 451-461.
[73] GANGULY D, MELHEM R, YANG J. An adaptive
framework for oversubscription management in cpu-gpu
unified memory[C]//2021 Design, Automation & Test in
Europe Conference & Exhibition (DATE). IEEE, 2021:
1212-1217.
[74] HILDRUM K, YU P S. Focused community
discovery[C]//Fifth IEEE International Conference on
Data Mining (ICDM’05). IEEE, 2005: 4- pp.
[75] REN B, AGRAWAL G, LARUS J R, et al. Simd
parallelization of applications that traverse irregular data
structures[C]//Proceedings of the 2013 IEEE/ACM
International Symposium on Code Generation and
Optimization (CGO). IEEE, 2013: 1-10.
[76] Thomson Comer. Accelerating Geographic Information
Systems (GIS) Data Science with RAPIDS cuSpatial and
GPUs[EB/OL]. [2023-10-20].
https://medium.com/rapids-ai/acclerating-gis-data-scienc
e-with-rapids-cuspatial-and-gpus-fd012b27af0a.
[77] AMD. AMD APP SDK OpenCL Optimization
Guide[EB/OL]. [2023-10-20].
https://www.amd.com/system/files/TechDocs/AMD_Ope
nCL_Programming_Optimization_Guide2.pdf.
[78] AMD. ARM Mali GPU OpenCL Developer
Guide[EB/OL]. [2023-10-20].
https://documentation-service.arm.com/static/633fe2dbd
a191e7fe057f2ac.
[79] LI C, AUSAVARUNGNIRUN R, ROSSBACH C J, et al.A framework for memory oversubscription management
in graphics processing units[C]//Proceedings of the
Twenty-Fourth International Conference on Architectural
Support for Programming Languages and Operating
Systems. 2019: 49-63.
[80] JIANG S, CHEN F, ZHANG X. Clock-pro: An effective
improvement of the clock replacement.[C]//USENIX
Annual Technical Conference, General Track. 2005:
323-336.
[81] JALEEL A, THEOBALD K B, STEELY JR S C, et al.
High performance cache replacement using re-reference
interval prediction (rrip)[J]. ACM SIGARCH computer
architecture news, 2010, 38(3): 60-71.
[82] YU Q, CHILDERS B, HUANG L, et al.
Hpe:Hierarchical page eviction policy for unified
memory in gpus[J]. IEEE Transactions on
ComputerAided Design of Integrated Circuits and
Systems, 2019, 39(10): 2461- 2474.
[83] CHE S, BOYER M, MENG J, et al. Rodinia: A
benchmark suite for heterogeneous computing[C]//2009
IEEE international symposium on workload
characterization (IISWC). Ieee, 2009: 44-54.
[84] STRATTON J A, RODRIGUES C, SUNG I J, et al.
Parboil: A revised benchmark suite for scientific and
commercial throughput computing[J]. Center for
Reliable and High-Performance Computing, 2012, 127:
27.
[85] GRAUER-GRAY S, XU L, SEARLES R, et al.
Auto-tuning a high-level language targeted to gpu
codes[C]//2012 innovative parallel computing (InPar).
Ieee, 2012: 1-10.
[86] YU Q, CHILDERS B, HUANG L, et al. Coordinated
page prefetch and eviction for memory oversubscription
management in gpus[C]//2020 IEEE International
Parallel and Distributed Processing Symposium (IPDPS).
IEEE, 2020: 472-482.
[87] KIM H, SIM J, GERA P, et al. Batch-aware unified
memory management in gpus for irregular
workloads[C]//Proceedings of the Twenty-Fifth
International Conference on Architectural Support for
Programming Languages and Operating Systems. 2020:
1357-1370.
[88] PARK D, KIM H, HAN H. Page reuse in cyclic thrashing
of gpu under oversubscription:
Work-in-progress[C]//2020 International Conference on
Compilers, Architecture, and Synthesis for Embedded
Systems (CASES). IEEE, 2020: 15-16.
[89] LI L, CHAPMAN B. Compiler assisted hybrid implicit
and explicit gpu memory management under unified
address space[C]//Proceedings of the International
Conference for High Performance Computing,
Networking, Storage and Analysis. 2019: 1-16.
[90] CHANG C H, KUMAR A, SIVASUBRAMANIAM A. To
move or not to move? page migration for irregular
applications in over-subscribed gpu memory systems
with dynamap[C]//Proceedings of the 14th ACM
International Conference on Systems and Storage. 2021:
1-12.
[91] MARKTHUB P, BELVIRANLI M E, LEE S, et al.
Dragon: breaking gpu memory capacity limits with direct
nvm access[C]//SC18: International Conference for High
Performance Computing, Networking, Storage and
Analysis. IEEE, 2018: 414-426.
[92] WU K, REN J, LI D. Runtime data management on
non-volatile memory-based heterogeneous memory for
task-parallel programs[C]//SC18: International
Conference for High Performance Computing,
Networking, Storage and Analysis. IEEE, 2018: 401-413.
[93] 王嘉伦 . 基于 GPU 的大规模数据分析查询的统一内存
管理和系统性能优化 [D]. 上海:华东师范大学 , 2023.
WANG Jialun. Unified Memory Management And System
Performance Optimization For Gpu-Based Large-Scale
Analytical Query Processing[D]. Shanghai, China: East China
Normal University, 2023.
[94] WANG J, PANG W, WENG C, et al. D-cubicle: boosting
data transfer dynamically for large-scale analytical
queries in single-gpu systems[J]. Frontiers of Computer
Science, 2023, 17(4): 174610.
[95] BAE J, LEE J, JIN Y, et al. FlashNeuron:SSD-Enabled
Large-Batch Training of Very Deep Neural
Networks[C]//19th USENIX Conference on File and
Storage Technologies (FAST 21). 2021: 387-401.
[96] CHOUKSE E, SULLIVAN M B, O’CONNOR M, et al.
Buddy compression: Enabling larger memory for deep
learning and hpc workloads on gpus[C]//2020
ACM/IEEE 47th Annual International Symposium on
Computer Architecture (ISCA). IEEE, 2020: 926-939.
[97] HAN S, POOL J, TRAN J, et al. Learning both weights
and connections for efficient neural network[J].
Advances in neural information processing systems,2015, 28.
[98] JAIN A, PHANISHAYEE A, MARS J, et al. Gist:
Efficient data encoding for deep neural network
training[C]//2018 ACM/IEEE 45th Annual International
Symposium on Computer Architecture (ISCA). IEEE,
2018: 776-789.
[99] CHEN T, XU B, ZHANG C, et al. Training deep nets
with sublinear memory cost[J]. arXiv preprint
arXiv:1604.06174, 2016.
[100] GRUSLYS A, MUNOS R, DANIHELKA I, et al.
Memory-efficient backpropagation through time[J].
Advances in neural information processing systems,
2016, 29.
[101] PENG X, SHI X, DAI H, et al. Capuchin: Tensor-based
gpu memory management for deep
learning[C]//Proceedings of the Twenty-Fifth
International Conference on Architectural Support for
Programming Languages and Operating Systems. 2020:
891-905.
[102] AWAN A A, CHU C H, SUBRAMONI H, et al. Oc-dnn:
Exploiting advanced unified memory capabilities in cuda
9 and volta gpus for out-ofcore dnn training[C]//2018
IEEE 25th International Conference on High
Performance Computing (HiPC). IEEE, 2018: 143-152.
[103] HILDEBRAND M, KHAN J, TRIKA S, et al. Autotm:
Automatic tensor movement in heterogeneous memory
systems using integer linear
programming[C]//Proceedings of the Twenty-Fifth
International Conference on Architectural Support for
Programming Languages and Operating Systems. 2020:
875-890.
[104] HUANG C C, JIN G, LI J. Swapadvisor: Pushing deep
learning beyond the gpu memory limit via smart
swapping[C]//Proceedings of the TwentyFifth
International Conference on Architectural Support for
Programming Languages and Operating Systems. 2020:
1341-1355.
[105] LE T D, IMAI H, NEGISHI Y, et al. Tflms: Large model
support in tensorflow by graph rewriting[J]. arXiv
preprint arXiv:1807.02037, 2018.
[106] RASLEY J, RAJBHANDARI S, RUWASE O, et al.
Deepspeed: System optimizations enable training deep
learning models with over 100 billion
parameters[C]//Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery &
Data Mining. 2020: 3505-3506.
[107] REN J, LUO J, WU K, et al. Sentinel: Efficient tensor
migration and allocation on heterogeneous memory
systems for deep learning[C]//2021 IEEE International
Symposium on High-Performance Computer
Architecture (HPCA). IEEE, 2021: 598-611.
[108] CHIEN S, PENG I, MARKIDIS S. Performance
evaluation of advanced features in cuda unified
memory[C]//2019 IEEE/ACM Workshop on Memory
Centric High Performance Computing (MCHPC). IEEE,
2019: 50-57.
[109] 王鹤澎 , 王宏志 , 李佳宁 , 等 . 面向新型处理器的数据密集
型计算 [J]. 软件学报 ,2016,27(8):2048−2067.
Wang HP, Wang HZ, Li JN, et al. New processor for
data-intensive computing[J]. Ruan Jian Xue Bao/Journal of
Software, 2016,27(8):2048−2067
|