GPGPU和CUDA统一内存研究现状综述

doi:10.19678/j.issn.1000-3428.0068694

摘要/Abstract

摘要： 大数据背景下，随着科学计算、人工智能等领域的快速发展，各领域对硬件的算力要求越来越高。GPU特殊的硬件架构，使其适合进行高并行度的计算，并且近年来GPU与人工智能、科学计算等领域互相发展促进，使GPU成为了CPU最重要的协处理器之一。然而GPU的硬件配置在出厂后不易更改，且显存容量有限，在处理大数据集时显存容量不足的缺点对计算性能存在较大影响。CUDA 6.0推出了统一内存，使GPU和CPU可以共享相同的虚拟内存空间，以此来简化异构编程和扩展GPU可访问的内存空间。统一内存为GPU处理大数据集提供了一项可行的解决方案，在一定程度上缓解了GPU显存容量较小的问题，但是统一内存的使用也带来了一些性能问题，如何在统一内存中做好内存管理成为性能提升的关键。本文将对CUDA统一内存的发展和应用进行综述，包括：CUDA统一内存的特性、发展、优势和局限性，以及在人工智能和大数据处理系统等领域的应用，未来的发展前景等。为未来使用和优化CUDA统一内存的研究工作提供有价值的参考。

Abstract: In the context of big data, with the rapid development of scientific computing, artificial intelligence and other fields, there is an increasing demand for high computational power across various domains. The unique hardware architecture of the GPU makes it suitable for highly parallel computing. In recent years, the mutual development of GPU with fields such as artificial intelligence and scientific computing has made GPU one of the most crucial co-processors alongside CPU. However, due to the unmodifiable nature of GPU hardware configurations after manufacturing and the limited capacity of memory, the drawback of insufficient memory capacity has a significant impact on computational performance, especially when dealing with large datasets. To address this issue, CUDA 6.0 introduced UM(Unified Memory), enabling GPU and CPU to share the same virtual memory space, thereby simplifying heterogeneous programming and expanding the GPU-accessible memory space. UM extends the GPU memory, providing a solution for GPU processing of large datasets and alleviating the issue of limited GPU memory capacity. However, the use of UM also introduces some performance issues. Effective data management within UM is the key to enhancing performance. This article provides an overview of the development and application of CUDA UM. It covers topics such as the features and evolution of UM, its advantages and limitations, applications in artificial intelligence and big data processing systems, and the future prospects of this technology. This article provides a valuable reference for future work on applying and optimizing CUDA UM.

庞文豪, 王嘉伦, 翁楚良. GPGPU和CUDA统一内存研究现状综述[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0068694.

PANG Wenhao, WANG Jialun, WENG Chuliang. Survey on GPGPU and CUDA Unified Memory Research Status[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0068694.

参考文献

[1] 中关村云计算产业联盟，汉能投资集团. 2022 年中国云计算生态蓝皮书[R]. 2022:44–50. ZhongGuanCun Cloud Computing industry Alliance, HINA. Cloud Computing Ecosystem Report[R]. 2022:44–50.
[2] NVIDIA Corporation. How to Optimize Data Transfers in CUDA C/C++ [EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/how-optimize-data-tra nsfers-cuda-cc/.
[3] NVIDIA Corporation. NVIDIA Tesla P100[EB/OL]. [2023-10-20]. https://images.nvidia.com/content/pdf/tesla/whitepaper/p ascal-architecture-whitepaper.pdf.
[4] NVIDIA Corporation. NVIDIA TESLA V100 GPU ARCHITECTURE [EB/OL]. [2023-10-22]. https://images.nvidia.com/content/volta-architecture/pdf/ volta-architecture-whitepaper.pdf.
[5] Arm Developer. Arm Mali GPU OpenCL Developer Guide[EB/OL]. [2023-10-21]. https://developer.arm.com/documentation/101574/0502/ OpenCL-2-0/Shared-virtual-memory.
[6] Ben Ashbaugh. Cl Intel Unified Shared Memory[EB/OL].[2023-10-29]. https://registry.khronos.org/OpenCL/extensions/intel/cl_ intel_unified_shared_memory.html.
[7] Jon Peddie Research. Q1’22 saw a decline in GPU and PC shipments quarter-to-quarter[EB/OL]. [2023-10-23]. https://www.jonpeddie.com/news/q122-saw-a-decline-in- gpu-and-pc-shipments-quarter-to-quarter/l.
[8] METAX-TECH. Metax-tech[EB/OL]. [2023-10-21]. https://www.metax-tech.com/.
[9] MTHREADS. Mthreads[EB/OL]. [2023-10-25]. https://www.mthreads.com/.
[10] BIRENTECH. Birentech[EB/OL]. [2023-10-19]. https://www.birentech.com/.
[11] BIREN TECHNOLOGY. BR100[EB/OL]. [2023-10-20]. https://www.birentech.com/News_details/16125806.html .
[12] ILUVATAR. Iluvatar[EB/OL]. [2023-10-20]. https://www.iluvatar.com/.
[13] CAMBRICON. Cambricon[EB/OL]. [2023-10-20]. https://www.cambricon.com/.
[14] HYGON. Hygon[EB/OL]. [2023-10-20]. https://www.hygon.cn/product/accelerator.
[15] HYGON. Hygon[EB/OL]. [2023-10-20]. https://www.hygon.cn/product/accelerator.
[16] SIETIUM. Sietium[EB/OL]. [2023-10-20]. https://www.sietium.com/.
[17] ENFLAME-TECH. Enflame-tech[EB/OL]. [2023-10-20]. https://www.enflame-tech.com/.
[18] DENGLINAI. Denglinai[EB/OL]. [2023-10-20]. https://denglinai.com/.
[19] INNOSILICON. Innosilicon[EB/OL]. [2023-10-20]. https://www.innosilicon.cn/.
[20] ZHAOXIN. Zhaoxin[EB/OL]. [2023-10-22]. https://www.zhaoxin.com/.
[21] CSIC-711. Csic-711[EB/OL]. [2023-10-20]. http://www.csic-711.com/ch/main.asp.
[22] ICUBECORP. Icubecorp[EB/OL]. [2023-10-18]. http://www.icubecorp.cn/.
[23] NVIDIA Corporation. Cuda toolkit,Develop, Optimize and Deploy GPU-Accelerated Apps[EB/OL]. [2023-10-20]. https://developer.nvidia.com/cuda-toolkit.
[24] OpenCL. OPEN STANDARD FOR PARALLEL PROGRAMMING OF HETEROGENEOUS SYSTEMS[EB/OL]. [2023-10-20]. https://www.khronos.org/opencl/.
[25] AMD. AMD ROCm™ Documentation[EB/OL]. [2023-10-20]. https://rocm.docs.amd.com/en/latest/.
[26] NVIDIA Corporation. An Easy Introduction to CUDA C and C++ [EB/OL]. [2023-10-21]. https://developer.nvidia.com/blog/easy-introduction-cud a-c-and-c/.
[27] LLVM. The LLVM Compiler Infrastructure[EB/OL]. [2023-10-17]. https://llvm.org/.
[28] NVIDIA Corporation. cuBLAS[EB/OL]. [2023-10-26]. https://docs.nvidia.com/cuda/cublas/.
[29] NVIDIA Corporation. cuFFT API Reference[EB/OL]. [2023-10-22]. https://docs.nvidia.com/cuda/cufft/index.html.
[30] NVIDIA Corporation. cuRAND[EB/OL]. [2023-10-15]. https://docs.nvidia.com/cuda/curand/index.html.
[31] NVIDIA Corporation. NVIDIA 2D Image And Signal Performance Primitives (NPP) [EB/OL]. [2023-10-20]. https://docs.nvidia.com/cuda/npp/index.html.
[32] NVIDIA Corporation. Nvidia Cuda-C-Programming[EB/OL]. [2023-10-19]. https://docs.nvidia.com/cuda/cuda-c-programming-guide /index.html.
[33] LI Z, PENG B, WENG C. Xeflow: Streamlining inter-processor pipeline execution for the discrete cpu-gpu platform[J]. IEEE Transactions on Computers, 2020, 69(6): 819-831.
[34] KWON Y, RHU M. A case for memory-centric hpc system architecture for training deep neural networks[J]. IEEE computer architecture letters, 2018, 17(2): 134-138.
[35] MENG C, SUN M, YANG J, et al. Training deeper models by gpu memory optimization on tensorflow[C]//Proc. of ML Systems Workshop in NIPS: volume 7. 2017.
[36] RHU M, GIMELSHEIN N, CLEMONS J, et al. vdnn: Virtualized deep neural networks for scalable, memory-efficient neural network design [C]//2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2016: 1-13.
[37] RHU M, O’CONNOR M, CHATTERJEE N, et al. Compressing dma engine: Leveraging activation sparsity for training deep neural networks [C]//2018 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2018: 78-91.
[38] ZHENG T, NELLANS D, ZULFIQAR A, et al. Towardshigh performance paged memory for gpus[C]//2016 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2016: 345- 357.
[39] NVIDIA Corporation. NVIDIA Pascal Architecture[EB/OL]. [2023-10-20]. https://www.nvidia.com/en-us/data-center/pascal-gpu-ar chitecture/.
[40] AGARWAL N, NELLANS D, STEPHENSON M, et al. Page placement strategies for gpus within heterogeneous memory systems[C]// Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems. 2015: 607- 618.
[41] AMD. AMD GRAPHICS CORES NEXT (GCN) ARCHITECTURE [EB/OL]. [2023-10-20]. https://www.techpowerup.com/gpu-specs/docs/amd-gcn1 -architecture.pdf.
[42] Mark Harris. Unified Memory in CUDA 6[EB/OL]. [2023-10-23]. https://developer.nvidia.com/blog/unified-memory-in-cu da-6/.
[43] NVIDIA Corporation. Cuda c++ best practices guide,Zero Copy [EB/OL]. [2023-10-13]. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide /index.html#zero-copy.
[44] NVIDIA Corporation. Peer-to-Peer Unified Virtual Addressing[EB/OL]. [2023-10-17]. https://developer.download.nvidia.com/CUDA/training/c uda_webinars_GPUDirect_uva.pdf.
[45] NVIDIA Corporation. Cuda c++ best practices guide,Unified Virtual Addressing[EB/OL]. [2023-10-17]. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide /index.html#unified-virtual-addressing.
[46] NVIDIA Corporation. CUDA Toolkit 4.0[EB/OL]. [2023-10-21]. https://developer.nvidia.com/cuda-toolkit-40.
[47] SPECIFICATION B. Pci express® base specification revision 3.0[J]. 2002.
[48] NVIDIA Corporation. Nvidia Cuda-C-Programming-Performance Tuning[EB/OL]. [2023-10-18]. https://docs.nvidia.com/cuda/cuda-c-programming-guide /index.html#performance-tuning.
[49] NVIDIA Corporation. Profiler User’s Guide[EB/OL]. [2023-10-17]. https://docs.nvidia.com/cuda/profiler-users-guide/index. html.
[50] NVIDIA Corporation. NVIDIA Nsight Systems[EB/OL]. [2023-10-19]. https://developer.nvidia.com/nsight-systems.
[51] NVIDIA Corporation. NVIDIA Nsight Compute[EB/OL]. [2023-10-22]. https://developer.nvidia.com/nsight-compute.
[52] NVIDIA Corporation. CUDA Runtime API[EB/OL]. [2023-10-25]. https://docs.nvidia.com/cuda/cuda-runtime-api/.
[53] JUNG J, KIM J, LEE J. Deepum: Tensor migration and prefetching in unified memory[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2. 2023: 207-221.
[54] GELADO I, STONE J E, CABEZAS J, et al. An asymmetric distributed shared memory model for heterogeneous parallel systems[C]//Proceedings of the fifteenth International Conference on Architectural support for programming languages and operating systems. 2010: 347- 358.
[55] JABLIN T B, PRABHU P, JABLIN J A, et al. Automatic cpu-gpu communication management and optimization[C]//Proceedings of the 32nd ACM SIGPLAN conference on Programming language design and implementation. 2011: 142-151.
[56] JABLIN T B, JABLIN J A, PRABHU P, et al. Dynamically managed data for cpu-gpu architectures[C]//Proceedings of the Tenth International Symposium on Code Generation and Optimization. 2012: 165-174.
[57] PAI S, GOVINDARAJAN R, THAZHUTHAVEETIL M J. Fast and efficient automatic memory management for gpus using compiler-assisted runtime coherence scheme[C]//Proceedings of the 21st international conference on Parallel architectures and compilation techniques. 2012: 33- 42.
[58] ALSABER N, KULKARNI M. Semcache: Semantics-aware caching for efficient gpu offloading[C]//Proceedings of the 27th international ACM conference on International conference on supercomputing. 2013: 421- 432.
[59] WANG L, YE J, ZHAO Y, et al. Superneurons: Dynamic gpu memory management for training deep neuralnetworks[C]//Proceedings of the 23rd ACM SIGPLAN symposium on principles and practice of parallel programming. 2018: 41-53.
[60] 裴威 , 李战怀 , 潘巍 . GPU 数据库核心技术综述 [J]. 软件学报 , 2021, 32(3): 859-885. Pei W, Li ZH, Pan W. Survey of key technologies in GPU database system[J]. Ruan Jian Xue Bao/Journal of Software, 2021,32(3):859−885
[61] 李志方 . 异构体系结构上的数据处理加速 [D]. 上海：华东师范大学 , 2021. LI Zhifang. Accelerating Data Processing on the Heterogeneous Architecture[D]. Shanghai, China: East China Normal University, 2021.
[62] DEAN J, GHEMAWAT S. Mapreduce: simplified data processing on large clusters[J]. Communications of the ACM, 2008, 51(1): 107-113.
[63] HE B, FANG W, LUO Q, et al. Mars: a mapreduce framework on graphics processors[C]//Proceedings of the 17th international conference on Parallel architectures and compilation techniques. 2008: 260-269.
[64] Nikolay Sakharnykh. UNIFIED MEMORY ON PASCAL AND VOLTA [EB/OL]. [2023-10-20]. https://on-demand.gputechconf.com/gtc/2017/presentatio n/s7285-nikolay-sakharnykh-unified-memory-on-pascal- and-volta.pdf.
[65] Nikolay Sakharnykh. Beyond GPU Memory Limits with Unified Memory on Pascal[EB/OL]. [2023-10-20]. https://developer.nvidia.com/blog/beyond-gpu-memory-l imits-unified-memory-pascal/.
[66] NVIDIA Corporation. Maximizing Unified Memory Performance in CUDA[EB/OL]. [2023-10-20]. https://developer.nvidia.com/blog/maximizing-unified-m emory-performance-cuda/.
[67] Nikolay Sakharnykh. EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY[EB/OL]. [2023-10-20]. https://on-demand.gputechconf.com/gtc/2018/presentatio n/s8430-everything-you-need-to-know-about-unified-me mory.pdf.
[68] JOG A, KAYIRAN O, CHIDAMBARAM NACHIAPPAN N, et al. Owl: cooperative thread array aware scheduling techniques for improving gpgpu performance[J]. ACM SIGPLAN Notices, 2013, 48(4): 395-406.
[69] JOHNSON T L, MERTEN M C, HWU W M W. Run-time spatial locality detection and optimization[C]//Proceedings of 30th Annual International Symposium on Microarchitecture. IEEE, 1997: 57-64.
[70] JOG A, KAYIRAN O, MISHRA A K, et al. Orchestrated scheduling and prefetching for gpgpus[C]//Proceedings of the 40th Annual International Symposium on Computer Architecture. 2013: 332-343.
[71] GANGULY D, ZHANG Z, YANG J, et al. Interplay between hardware prefetcher and page eviction policy in cpu-gpu unified virtual memory[C]//Proceedings of the 46th International Symposium on Computer Architecture. 2019: 224-235.
[72] GANGULY D, ZHANG Z, YANG J, et al. Adaptive page migration for irregular data-intensive applications under gpu memory oversubscription[C]//2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020: 451-461.
[73] GANGULY D, MELHEM R, YANG J. An adaptive framework for oversubscription management in cpu-gpu unified memory[C]//2021 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2021: 1212-1217.
[74] HILDRUM K, YU P S. Focused community discovery[C]//Fifth IEEE International Conference on Data Mining (ICDM’05). IEEE, 2005: 4- pp.
[75] REN B, AGRAWAL G, LARUS J R, et al. Simd parallelization of applications that traverse irregular data structures[C]//Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). IEEE, 2013: 1-10.
[76] Thomson Comer. Accelerating Geographic Information Systems (GIS) Data Science with RAPIDS cuSpatial and GPUs[EB/OL]. [2023-10-20]. https://medium.com/rapids-ai/acclerating-gis-data-scienc e-with-rapids-cuspatial-and-gpus-fd012b27af0a.
[77] AMD. AMD APP SDK OpenCL Optimization Guide[EB/OL]. [2023-10-20]. https://www.amd.com/system/files/TechDocs/AMD_Ope nCL_Programming_Optimization_Guide2.pdf.
[78] AMD. ARM Mali GPU OpenCL Developer Guide[EB/OL]. [2023-10-20]. https://documentation-service.arm.com/static/633fe2dbd a191e7fe057f2ac.
[79] LI C, AUSAVARUNGNIRUN R, ROSSBACH C J, et al.A framework for memory oversubscription management in graphics processing units[C]//Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. 2019: 49-63.
[80] JIANG S, CHEN F, ZHANG X. Clock-pro: An effective improvement of the clock replacement.[C]//USENIX Annual Technical Conference, General Track. 2005: 323-336.
[81] JALEEL A, THEOBALD K B, STEELY JR S C, et al. High performance cache replacement using re-reference interval prediction (rrip)[J]. ACM SIGARCH computer architecture news, 2010, 38(3): 60-71.
[82] YU Q, CHILDERS B, HUANG L, et al. Hpe:Hierarchical page eviction policy for unified memory in gpus[J]. IEEE Transactions on ComputerAided Design of Integrated Circuits and Systems, 2019, 39(10): 2461- 2474.
[83] CHE S, BOYER M, MENG J, et al. Rodinia: A benchmark suite for heterogeneous computing[C]//2009 IEEE international symposium on workload characterization (IISWC). Ieee, 2009: 44-54.
[84] STRATTON J A, RODRIGUES C, SUNG I J, et al. Parboil: A revised benchmark suite for scientific and commercial throughput computing[J]. Center for Reliable and High-Performance Computing, 2012, 127: 27.
[85] GRAUER-GRAY S, XU L, SEARLES R, et al. Auto-tuning a high-level language targeted to gpu codes[C]//2012 innovative parallel computing (InPar). Ieee, 2012: 1-10.
[86] YU Q, CHILDERS B, HUANG L, et al. Coordinated page prefetch and eviction for memory oversubscription management in gpus[C]//2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2020: 472-482.
[87] KIM H, SIM J, GERA P, et al. Batch-aware unified memory management in gpus for irregular workloads[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020: 1357-1370.
[88] PARK D, KIM H, HAN H. Page reuse in cyclic thrashing of gpu under oversubscription: Work-in-progress[C]//2020 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). IEEE, 2020: 15-16.
[89] LI L, CHAPMAN B. Compiler assisted hybrid implicit and explicit gpu memory management under unified address space[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. 2019: 1-16.
[90] CHANG C H, KUMAR A, SIVASUBRAMANIAM A. To move or not to move? page migration for irregular applications in over-subscribed gpu memory systems with dynamap[C]//Proceedings of the 14th ACM International Conference on Systems and Storage. 2021: 1-12.
[91] MARKTHUB P, BELVIRANLI M E, LEE S, et al. Dragon: breaking gpu memory capacity limits with direct nvm access[C]//SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018: 414-426.
[92] WU K, REN J, LI D. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs[C]//SC18: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2018: 401-413.
[93] 王嘉伦 . 基于 GPU 的大规模数据分析查询的统一内存管理和系统性能优化 [D]. 上海：华东师范大学 , 2023. WANG Jialun. Unified Memory Management And System Performance Optimization For Gpu-Based Large-Scale Analytical Query Processing[D]. Shanghai, China: East China Normal University, 2023.
[94] WANG J, PANG W, WENG C, et al. D-cubicle: boosting data transfer dynamically for large-scale analytical queries in single-gpu systems[J]. Frontiers of Computer Science, 2023, 17(4): 174610.
[95] BAE J, LEE J, JIN Y, et al. FlashNeuron:SSD-Enabled Large-Batch Training of Very Deep Neural Networks[C]//19th USENIX Conference on File and Storage Technologies (FAST 21). 2021: 387-401.
[96] CHOUKSE E, SULLIVAN M B, O’CONNOR M, et al. Buddy compression: Enabling larger memory for deep learning and hpc workloads on gpus[C]//2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2020: 926-939.
[97] HAN S, POOL J, TRAN J, et al. Learning both weights and connections for efficient neural network[J]. Advances in neural information processing systems,2015, 28.
[98] JAIN A, PHANISHAYEE A, MARS J, et al. Gist: Efficient data encoding for deep neural network training[C]//2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). IEEE, 2018: 776-789.
[99] CHEN T, XU B, ZHANG C, et al. Training deep nets with sublinear memory cost[J]. arXiv preprint arXiv:1604.06174, 2016.
[100] GRUSLYS A, MUNOS R, DANIHELKA I, et al. Memory-efficient backpropagation through time[J]. Advances in neural information processing systems, 2016, 29.
[101] PENG X, SHI X, DAI H, et al. Capuchin: Tensor-based gpu memory management for deep learning[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020: 891-905.
[102] AWAN A A, CHU C H, SUBRAMONI H, et al. Oc-dnn: Exploiting advanced unified memory capabilities in cuda 9 and volta gpus for out-ofcore dnn training[C]//2018 IEEE 25th International Conference on High Performance Computing (HiPC). IEEE, 2018: 143-152.
[103] HILDEBRAND M, KHAN J, TRIKA S, et al. Autotm: Automatic tensor movement in heterogeneous memory systems using integer linear programming[C]//Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020: 875-890.
[104] HUANG C C, JIN G, LI J. Swapadvisor: Pushing deep learning beyond the gpu memory limit via smart swapping[C]//Proceedings of the TwentyFifth International Conference on Architectural Support for Programming Languages and Operating Systems. 2020: 1341-1355.
[105] LE T D, IMAI H, NEGISHI Y, et al. Tflms: Large model support in tensorflow by graph rewriting[J]. arXiv preprint arXiv:1807.02037, 2018.
[106] RASLEY J, RAJBHANDARI S, RUWASE O, et al. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 2020: 3505-3506.
[107] REN J, LUO J, WU K, et al. Sentinel: Efficient tensor migration and allocation on heterogeneous memory systems for deep learning[C]//2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2021: 598-611.
[108] CHIEN S, PENG I, MARKIDIS S. Performance evaluation of advanced features in cuda unified memory[C]//2019 IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). IEEE, 2019: 50-57.
[109] 王鹤澎 , 王宏志 , 李佳宁 , 等 . 面向新型处理器的数据密集型计算 [J]. 软件学报 ,2016,27(8):2048−2067. Wang HP, Wang HZ, Li JN, et al. New processor for data-intensive computing[J]. Ruan Jian Xue Bao/Journal of Software, 2016,27(8):2048−2067

选择文件类型/文献管理软件名称

选择包含的内容