Survey on GPGPU and CUDA Unified Memory Research Status

doi:10.19678/j.issn.1000-3428.0068694

Abstract

Abstract:

In the context of big data, the rapid advancement of fields such as scientific computing and artificial intelligence, there is an increasing demand for high computational power across various domains. The unique hardware architecture of the Graphics Processing Unit (GPU) makes it suitable for parallel computing. In recent years, the concurrent development of GPUs and fields such as artificial intelligence and scientific computing has enhanced GPU capabilities, leading to the emergence of mature General-Purpose Graphics Processing Units (GPGPUs). Currently, GPGPUs are one of the most important co-processors for Central Processing Units (CPUs). However, the fixed hardware configuration of the GPU after delivery and its limited memory capacity can significantly hinder its performance, particularly when dealing with large datasets. To address this issue, Compute Unified Device Architecture (CUDA) 6.0 introduces unified memory, allowing GPGPU and CPU to share a virtual memory space, thereby simplifying heterogeneous programming and expanding the GPGPU-accessible memory space. Unified memory offers a solution for processing large datasets on GPGPUs and alleviates the constraints of limited GPGPU memory capacity. However, the use of unified memory introduces performance issues. Effective data management within unified memory is the key to enhancing performance. This article provides an overview of the development and application of CUDA unified memory. It covers topics such as the features and evolution of unified memory, its advantages and limitations, its applications in artificial intelligence and big data processing systems, and its prospects. This article provides a valuable reference for future work on applying and optimizing CUDA unified memory.

Key words: General-Purpose Graphics Processing Unit (GPGPU), unified memory, memory oversubscription, data management, heterogeneous system

摘要：

在大数据背景下, 随着科学计算、人工智能等领域的快速发展, 各领域对硬件的算力要求越来越高。图形处理器(GPU)特殊的硬件架构, 使其适合进行高并行度的计算, 并且近年来GPU与人工智能、科学计算等领域互相发展促进, 使GPU功能细化, 逐渐发展出了成熟的通用图形处理器(GPGPU), 目前GPGPU已成为中央处理器(CPU)最重要的协处理器之一。然而, GPU硬件配置在出厂后不容易更改且显存容量有限, 在处理大数据集时显存容量不足的缺点对计算性能造成较大的影响。统一计算设备架构(CUDA)6.0推出了统一内存, 使GPGPU和CPU可以共享虚拟内存空间, 以此来简化异构编程和扩展GPGPU可访问的内存空间。统一内存为GPGPU处理大数据集提供了一项可行的解决方案, 在一定程度上缓解了GPU显存容量较小的问题, 但是统一内存的使用也带来了一些性能问题, 如何在统一内存中做好内存管理成为性能提升的关键。本研究对CUDA统一内存的发展和应用进行综述, 包括CUDA统一内存的特性、发展、优势和局限性以及在人工智能、大数据处理系统等领域的应用和未来的发展前景, 为未来使用和优化CUDA统一内存的研究工作提供有价值的参考。

关键词: 通用图形处理器, 统一内存, 显存超额订阅, 数据管理, 异构系统

PANG Wenhao, WANG Jialun, WENG Chuliang. Survey on GPGPU and CUDA Unified Memory Research Status[J]. Computer Engineering, 2024, 50(12): 1-15.

庞文豪, 王嘉伦, 翁楚良. GPGPU和CUDA统一内存研究现状综述[J]. 计算机工程, 2024, 50(12): 1-15.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068694

https://www.ecice06.com/EN/Y2024/V50/I12/1

Figures/Tables 5

Fig.1 Development of NVIDIA GPU architecture

Fig.2 Unified memory in the CUDA

Fig.3 Data transfer in pageable memory and pinned memory

Fig.4 Pointer illustration of unified memory

Fig.5 Page fault handling in unified memory

References 109

1	中关村云计算产业联盟. 2022年中国云计算生态蓝皮书[EB/OL]. [2023-10-16]. https://baike.baidu.com/reference/61785033/533aYdO6cr3_z3kATKaPnqr0N33HMN-kvuXXUbpzzqIP0XOpSo_sUIEz6NYwsPVmHQ_e_pttbZkGyeGuB0pC7f4WdOg8QrUrnXX4UTfKzb_wuI9zl4MV-tEW.
	Zhongguancun Cloud Computing Industry Alliance. Cloud computing ecosystem report[EB/OL]. [2023-10-16]. https://baike.baidu.com/reference/61785033/533aYdO6cr3_z3kATKaPnqr0N33HMN-kvuXXUbpzzqIP0XOpSo_sUIEz6NYwsPVmHQ_e_pttbZkGyeGuB0pC7f4WdOg8QrUrnXX4UTfKzb_wuI9zl4MV-tEW. (in Chinese)
2	NVIDIA Corporation. How to optimize data transfers in CUDA C/C++[EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/how-optimize-data-transfers-cuda-cc/.
3	NVIDIA Corporation. NVIDIA Tesla P100[EB/OL]. [2023-10-16]. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf.
4	NVIDIA Corporation. NVIDIA TESLA V100 GPU architecture[EB/OL]. [2023-10-16]. https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf.
5	ARM Corporation. ARM Mali GPU OpenCL developer guide[EB/OL]. [2023-10-16]. https://developer.arm.com/documentation/101574/0502/OpenCL-2-0/Shared-virtual-memory.
6	ASHBAUGH B. Cl Intel unified shared memory[EB/OL]. [2023-10-16]. https://registry.khronos.org/OpenCL/extensions/intel/cl_intel_unified_shared_memory.html.
7	Jon Peddie Research. Q1'22 saw a decline in GPU and PC shipments quarter-to-quarter[EB/OL]. [2023-10-16]. https://www.jonpeddie.com/news/q122-saw-a-decline-in-gpu-and-pc-shipments-quarter-to-quarter/l.
8	METAX-TECH Corporation. METAX-TECH[EB/OL]. [2023-10-16]. https://www.metax-tech.com/.
9	MTHREADS Corporation. MTHREADS[EB/OL]. [2023-10-16]. https://www.mthreads.com/.
10	BIREN Technology Corporation. BIRENTECH[EB/OL]. [2023-10-16]. https://www.birentech.com/.
11	BIREN Technology Corporation. BR100[EB/OL]. [2023-10-16]. https://www.birentech.com/News_details/16125806.html.
12	ILUVATAR Corporation. ILUVATAR[EB/OL]. [2023-10-16]. https://www.iluvatar.com/.
13	CAMBRICON Corporation. CAMBRICON[EB/OL]. [2023-10-16]. https://www.cambricon.com/.
14	HYGON Corporation. HYGON[EB/OL]. [2023-10-16]. https://www.hygon.cn/product/accelerator.
15	JINGJIA MICRO Corporation. JINGJIA MICRO[EB/OL]. [2023-10-16]. https://www.jingjiamicro.com/.
16	SIETIUM Corporation. SIETIUM[EB/OL]. [2023-10-16]. https://www.sietium.com/.
17	ENFLAME-TECH Corporation. ENFLAME-TECH[EB/OL]. [2023-10-16]. https://www.enflame-tech.com/.
18	DENGLINAI Corporation. DENGLINAI[EB/OL]. [2023-10-16]. https://denglinai.com/.
19	INNOSILICON Corporation. INNOSILICON[EB/OL]. [2023-10-16]. https://www.innosilicon.cn/.
20	ZHAOXIN Corporation. ZHAOXIN[EB/OL]. [2023-10-16]. https://www.zhaoxin.com/.
21	Shanghai Marine Diesel Engine Research Institute. CSIC-711[EB/OL]. [2023-10-16]. http://www.csic-711.com/ch/main.asp.
22	ICUBECORP Corporation. ICUBECORP[EB/OL]. [2023-10-16]. http://www.icubecorp.cn/.
23	NVIDIA Corporation. CUDA toolkit, develop, optimize and deploy GPU-accelerated apps[EB/OL]. [2023-10-16]. https://developer.nvidia.com/cuda-toolkit.
24	OpenCL Corporation. OPEN standard for parallel programming of heterogeneous systems[EB/OL]. [2023-10-16]. https://www.khronos.org/opencl/.
25	AMD Corporation. AMD ROCm^TM documentation[EB/OL]. [2023-10-16]. https://rocm.docs.amd.com/en/latest/.
26	NVIDIA Corporation. An easy introduction to CUDA C and C++[EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/easy-introduction-cuda-c-and-c/.
27	LLVM. The LLVM compiler infrastructure[EB/OL]. [2023-10-16]. https://llvm.org/.
28	NVIDIA Corporation. cuBLAS[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cublas/.
29	NVIDIA Corporation. cuFFT API reference[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cufft/index.html.
30	NVIDIA Corporation. cuRAND[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/curand/index.html.
31	NVIDIA Corporation. NVIDIA 2D image and signal Performance Primitives (NPP)[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/npp/index.html.
32	NVIDIA Corporation. NVIDIA CUDA-C-programming[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html.
33	LI Z F , PENG B C , WENG C L . XeFlow: streamlining inter-processor pipeline execution for the discrete CPU-GPU platform. IEEE Transactions on Computers, 2020, 69 (6): 819- 831. doi: 10.1109/TC.2020.2968302
34	KWON Y , RHU M . A case for memory-centric HPC system architecture for training deep neural networks. IEEE Computer Architecture Letters, 2018, 17 (2): 134- 138. doi: 10.1109/LCA.2018.2823302
35	MENG C, SUN M, YANG J, et al. Training deeper models by GPU memory optimization on TensorFlow[EB/OL]. [2023-10-16]. https://www.semanticscholar.org/paper/Training-Deeper-Models-by-GPU-Memory-Optimization-Meng-Sun/497663d343870304b5ed1a2ebb997aaf09c4b529#:~:text=In%20this%20paper,%20we%20propose%20a%20general%20dataflow-graph%20based%20GPU.
36	RHU M, GIMELSHEIN N, CLEMONS J, et al. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design[C]//Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Washington D.C., USA: IEEE Press, 2016: 1-13.
37	RHU M, O'CONNOR M, CHATTERJEE N, et al. Compressing DMA engine: leveraging activation sparsity for training deep neural networks[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2018: 78-91.
38	ZHENG T H, NELLANS D, ZULFIQAR A, et al. Towards high performance paged memory for GPUs[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2016: 345-357.
39	NVIDIA Corporation. NVIDIA pascal architecture[EB/OL]. [2023-10-16]. https://www.nvidia.com/en-us/data-center/pascal-gpu-architecture/.
40	AGARWAL N, NELLANS D, STEPHENSON M, et al. Page placement strategies for GPUs within heterogeneous memory systems[C]//Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2015: 607-618.
41	AMD Corporation. AMD Graphics Cores Next (GCN) architecture[EB/OL]. [2023-10-16]. https://www.techpowerup.com/gpu-specs/docs/amd-gcn1-architecture.pdf.
42	HARRIS M. Unified memory in CUDA 6.0[EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/unified-memory-in-cuda-6/.
43	NVIDIA Corporation. CUDA C++ best practices guide, zero copy[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#zero-copy.
44	NVIDIA Corporation. Peer-to-peer unified virtual addressing[EB/OL]. [2023-10-16]. https://developer.download.nvidia.com/CUDA/training/cuda_webinars_GPUDirect_uva.pdf.
45	NVIDIA Corporation. CUDA C++ best practices guide, unified virtual addressing[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cuda-c-best-practices-guide/index.html#unified-virtual-addressing.
46	NVIDIA Corporation. CUDA toolkit 4.0[EB/OL]. [2023-10-16]. https://developer.nvidia.com/cuda-toolkit-40.
47	SPECIFICATION B. PCI express^® base specification revision 3.0[EB/OL]. [2023-10-16]. https://pcisig.com/pci-express-base-specification-revision-30#:~:text=This%20specification%20describes%20the%20PCI%20Express%C2%AE%20architecture.
48	NVIDIA Corporation. NVIDIA CUDA-C-programming-performance tuning[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#performance-tuning.
49	NVIDIA Corporation. Profiler user's guide[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/profiler-users-guide/index.html.
50	NVIDIA Corporation. NVIDIA Nsight Systems[EB/OL]. [2023-10-16]. https://developer.nvidia.com/nsight-systems.
51	NVIDIA Corporation. NVIDIA Nsight Compute[EB/OL]. [2023-10-16]. https://developer.nvidia.com/nsight-compute.
52	NVIDIA Corporation. CUDA runtime API[EB/OL]. [2023-10-16]. https://docs.nvidia.com/cuda/cuda-runtime-api/.
53	JUNG J, KIM J, LEE J. DeepUM: tensor migration and prefetching in unified memory[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2023: 207-221.
54	GELADO I, STONE J E, CABEZAS J, et al. An asymmetric distributed shared memory model for heterogeneous parallel systems[C]//Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2010: 347-358.
55	JABLIN T B, PRABHU P, JABLIN J A, et al. Automatic CPU-GPU communication management and optimization[C]//Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, USA: ACM Press, 2011: 142-151.
56	JABLIN T B, JABLIN J A, PRABHU P, et al. Dynamically managed data for CPU-GPU architectures[C]//Proceedings of the 10th International Symposium on Code Generation and Optimization. New York, USA: ACM Press, 2012: 165-174.
57	PAI S, GOVINDARAJAN R, THAZHUTHAVEETIL M J. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme[C]//Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA: IEEE Press, 2012: 33-42.
58	ALSABER N, KULKARNI M. SemCache: semantics-aware caching for efficient GPU offloading[C]//Proceedings of the 27th International ACM Conference on Supercomputing. New York, USA: ACM Press, 2013: 421-432.
59	WANG L N, YE J M, ZHAO Y Y, et al. SuperNeurons: dynamic GPU memory management for training deep neural networks[C]//Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2018: 41-53.
60	裴威, 李战怀, 潘巍. GPU数据库核心技术综述. 软件学报, 2021, 32 (3): 859- 885.
	PEI W , LI Z H , PAN W . Survey of key technologies in GPU database system. Journal of Software, 2021, 32 (3): 859- 885.
61	李志方. 异构体系结构上的数据处理加速[D]. 上海: 华东师范大学, 2021.
	LI Z F. Accelerating data processing on the heterogeneous architecture[D]. Shanghai: East China Normal University, 2021. (in Chinese)
62	DEAN J , GHEMAWAT S . MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51 (1): 107- 113. doi: 10.1145/1327452.1327492
63	HE B S, FANG W B, LUO Q, et al. Mars: a MapReduce framework on graphics processors[C]//Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA: IEEE Press, 2008: 260-269.
64	SAKHARNYKH N. Unified memory on pascal and volta[EB/OL]. [2023-10-16]. https://on-demand.gputechconf.com/gtc/2017/presentation/s7285-nikolay-sakharnykh-unified-memory-on-pascal-and-volta.pdf.
65	SAKHARNYKH N. Beyond GPU memory limits with unified memory on pascal[EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/beyond-gpu-memory-limits-unified-memory-pascal/.
66	NVIDIA Corporation. Maximizing unified memory performance in CUDA[EB/OL]. [2023-10-16]. https://developer.nvidia.com/blog/maximizing-unified-memory-performance-cuda/.
67	SAKHARNYKH N. Everything you need to know about unified memory[EB/OL]. [2023-10-16]. https://on-demand.gputechconf.com/gtc/2018/presentation/s8430-everything-you-need-to-know-about-unified-memory.pdf.
68	JOG A , KAYIRAN O , NACHIAPPAN N C , et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGPLAN Notices, 2013, 48 (4): 395- 406. doi: 10.1145/2499368.2451158
69	JOHNSON T L, MERTEN M C, HWU W W. Run-time spatial locality detection and optimization[C]//Proceedings of 30th Annual International Symposium on Microarchitecture. Washington D.C., USA: IEEE Press, 1997: 57-64.
70	JOG A, KAYIRAN O, MISHRA A K, et al. Orchestrated scheduling and prefetching for GPGPUs[C]//Proceedings of the 40th Annual International Symposium on Computer Architecture. New York, USA: ACM Press, 2013: 332-343.
71	GANGULY D, ZHANG Z Y, YANG J, et al. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory[C]//Proceedings of the ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2019: 224-235.
72	GANGULY D, ZHANG Z Y, YANG J, et al. Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2020: 451-461.
73	GANGULY D, MELHEM R, YANG J. An adaptive framework for oversubscription management in CPU-GPU unified memory[C]//Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). Washington D.C., USA: IEEE Press, 2021: 1212-1217.
74	HILDRUM K, YU P S. Focused community discovery[C]//Proceedings of the 5th IEEE International Conference on Data Mining. Washington D.C., USA: IEEE Press, 2005: 4.
75	REN B, AGRAWAL G, LARUS J R, et al. SIMD parallelization of applications that traverse irregular data structures[C]//Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). Washington D.C., USA: IEEE Press, 2013: 1-10.
76	COMER T. Accelerating Geographic Information Systems (GIS) data science with RAPIDS cuSpatial and GPUs[EB/OL]. [2023-10-16]. https://medium.com/rapids-ai/acclerating-gis-data-science-with-rapids-cuspatial-and-gpus-fd012b27af0a.
77	AMD Corporation. AMD APP SDK OpenCL optimization guide[EB/OL]. [2023-10-16]. https://www.amd.com/system/files/TechDocs/AMD_OpenCL_Programming_Optimization_Guide2.pdf.
78	AMD Corporation. ARM Mali GPU OpenCL developer guide[EB/OL]. [2023-10-16]. https://documentation-service.arm.com/static/633fe2dbda191e7fe057f2ac.
79	LI C, AUSAVARUNGNIRUN R, ROSSBACH C J, et al. A framework for memory oversubscription management in graphics processing units[C]//Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2019: 49-63.
80	JIANG S, CHEN F, ZHANG X. CLOCK-Pro: an effective improvement of the clock replacement[C]//Proceedings of the USENIX Annual Technical Conference. Berlin, Germany: Springer, 2005: 323-336.
81	JALEEL A , THEOBALD K B , STEELY S C , et al. High performance cache replacement using Re-Reference Interval Prediction (RRIP). ACM SIGARCH Computer Architecture News, 2010, 38 (3): 60- 71. doi: 10.1145/1816038.1815971
82	YU Q , CHILDERS B , HUANG L B , et al. HPE: hierarchical page eviction policy for unified memory in GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39 (10): 2461- 2474. doi: 10.1109/TCAD.2019.2944790
83	CHE S, BOYER M, MENG J Y, et al. Rodinia: a benchmark suite for heterogeneous computing[C]//Proceedings of the IEEE International Symposium on Workload Characterization (ⅡSWC). Washington D.C., USA: IEEE Press, 2009: 44-54.
84	STRATTON J A, RODRIGUES C, SUNG I J, et al. Parboil: a revised benchmark suite for scientific and commercial throughput computing[EB/OL]. [2023-10-16]. https://www.semanticscholar.org/paper/Parboil%3A-A-Revised-Benchmark-Suite-for-Scientific-Stratton-Rodrigues/5f3cce1bc739ebfc03e003010d3438bb318efc14?p2df.
85	GRAUER-GRAY S, XU L F, SEARLES R, et al. Auto-tuning a high-level language targeted to GPU codes[C]//Proceedings of 2012 Innovative Parallel Computing (InPar). Washington D.C., USA: IEEE Press, 2012: 1-10.
86	YU Q, CHILDERS B, HUANG L B, et al. Coordinated page prefetch and eviction for memory oversubscription management in GPUs[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2020: 472-482.
87	KIM H, SIM J, GERA P, et al. Batch-aware unified memory management in GPUs for irregular workloads[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 1357-1370.
88	PARK D, KIM H, HAN H. Page reuse in cyclic thrashing of GPU under oversubscription: work-in-progress[C]//Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). Washington D.C., USA: IEEE Press, 2020: 15-16.
89	LI L D, CHAPMAN B. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, USA: ACM Press, 2019: 1-16.
90	CHANG C H, KUMAR A, SIVASUBRAMANIAM A. To move or not to move? Page migration for irregular applications in over-subscribed GPU memory systems with DynaMap[C]//Proceedings of the 14th ACM International Conference on Systems and Storage. New York, USA: ACM Press, 2021: 1-12.
91	MARKTHUB P, BELVIRANLI M E, LEE S Y, et al. DRAGON: breaking GPU memory capacity limits with direct NVM access[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 414-426.
92	WU K, REN J, LI D. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 401-413.
93	王嘉伦. 基于GPU的大规模数据分析查询的统一内存管理和系统性能优化[D]. 上海: 华东师范大学, 2023.
	WANG J L. Unified memory management and system performance optimization for GPU-based large-scale analytical query processing[D]. Shanghai: East China Normal University, 2023. (in Chinese)
94	WANG J L , PANG W H , WENG C L , et al. D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems. Frontiers of Computer Science, 2023, 17 (4): 174610. doi: 10.1007/s11704-022-2160-z
95	BAE J, LEE J, JIN Y, et al. FlashNeuron: SSD-enabled large-batch training of very deep neural networks[C]//Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). Philadelphia, USA: ACL Press, 2021: 387-401.
96	CHOUKSE E, SULLIVAN M B, O'CONNOR M, et al. Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs[C]//Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2020: 926-939.
97	HAN S, POOL J, TRAN J, et al. Learning both weights and connections for efficient neural networks[EB/OL]. [2023-10-16]. http://arxiv.org/abs/1506.02626.
98	JAIN A, PHANISHAYEE A, MARS J, et al. Gist: efficient data encoding for deep neural network training[C]//Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2018: 776-789.
99	CHEN T Q, XU B, ZHANG C Y, et al. Training deep nets with sublinear memory cost[EB/OL]. [2023-10-16]. http://arxiv.org/abs/1604.06174.
100	GRUSLYS A, MUNOS R, DANIHELKA I, et al. Memory-efficient backpropagation through time[EB/OL]. [2023-10-16]. http://arxiv.org/abs/1606.03401.
101	PENG X, SHI X H, DAI H L, et al. Capuchin: tensor-based GPU memory management for deep learning[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 891-905.
102	AWAN A A, CHU C H, SUBRAMONI H, et al. OC-DNN: exploiting advanced unified memory capabilities in CUDA 9 and Volta GPUs for out-of-core DNN training[C]//Proceedings of the 25th International Conference on High Performance Computing (HiPC). Washington D.C., USA: IEEE Press, 2018: 143-152.
103	HILDEBRAND M, KHAN J, TRIKA S, et al. AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 875-890.
104	HUANG C C, JIN G, LI J Y. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 1341-1355.
105	LE T D, IMAI H, NEGISHI Y, et al. TFLMS: large model support in TensorFlow by graph rewriting[EB/OL]. [2023-10-16]. http://arxiv.org/abs/1807.02037.
106	RASLEY J, RAJBHANDARI S, RUWASE O, et al. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 3505-3506.
107	REN J, LUO J L, WU K, et al. Sentinel: efficient tensor migration and allocation on heterogeneous memory systems for deep learning[C]//Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2021: 598-611.
108	CHIEN S, PENG I, MARKIDIS S. Performance evaluation of advanced features in CUDA unified memory[C]//Proceedings of the IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). Washington D.C., USA: IEEE Press, 2019: 50-57.
109	王鹤澎, 王宏志, 李佳宁, 等. 面向新型处理器的数据密集型计算. 软件学报, 2016, 27 (8): 2048- 2067. doi: 10.13328/j.cnki.jos.005060
	WANG H P , WANG H Z , LI J N , et al. New processor for data-intensive computing. Journal of Software, 2016, 27 (8): 2048- 2067. doi: 10.13328/j.cnki.jos.005060

[1]	GUAN Mingxiao, LIU Jiakun, ZHANG Hongrui, HE Anping. Study of FPGA-based Error-controllable Floating-point Operation Accelerators [J]. Computer Engineering, 2024, 50(5): 291-297.
[2]	WANG Jinsong, YANG Weizheng, ZHAO Zening, WEI Jiajia. Survey of Directed Acyclic Graph Based Blockchain Technology [J]. Computer Engineering, 2022, 48(6): 11-23.
[3]	XU Chao, DENG Junhua, JIANG Lili, JI Mingtao, LI Xin. QoS-Aware Data Replica Recovery Algorithm for Cloud System [J]. Computer Engineering, 2019, 45(9): 17-22.
[4]	JIANG Zetao, SHI Chen. Cross-Domain Identity Authentication Scheme Based on Heterogeneous Systems in Hybrid Cloud Environment [J]. Computer Engineering, 2019, 45(10): 13-18.
[5]	ZHAO Ruijiao,ZHU Yian,LI Lian. Mixed-criticality Task Scheduling Algorithm Based on Heterogeneous Multi-core System [J]. Computer Engineering, 2018, 44(2): 51-55.
[6]	WANG Rui,LI Qing,ZHAO Qian. Aircraft Support Information System Integration Based on SOA and Web Service [J]. Computer Engineering, 2018, 44(1): 91-97.
[7]	JIANG Huifang,CAI Da,WANG Xiaorui. Computation Cost Evaluation Model Based on CPU-GPU Heterogeneous Environment [J]. Computer Engineering, 2017, 43(9): 12-16.
[8]	CHEN Xi,ZHU Jiantao,HE Xiaobin. An HPC-oriented Distributed Object Storage System [J]. Computer Engineering, 2017, 43(8): 69-73.
[9]	GUO Mengyu,KANG Hong,YUAN Xiaojie. Real-time Database Partitioning System Based on Streaming Computing Framework [J]. Computer Engineering, 2017, 43(11): 8-15.
[10]	HUANG Qiulan,CHENG Yaodong,DU Ran,CHEN Gang. Design of Scalable Distributed Metadata Management System [J]. Computer Engineering, 2015, 41(5): 26-32,37.
[11]	LIN Lei1, SUN Yong, LI Wei-dong, DENG Zi-yan, ZHANG Xiao-mei, Nicholson Caitriana. Metadata Management for BESIII Distributed Computing [J]. Computer Engineering, 2014, 40(2): 39-43,47.
[12]	YUAN Pu-Ji, WANG Cheng, HUANG Ling-Fan. Information System of Detection and Calibration Business Based on Metadata [J]. Computer Engineering, 2013, 39(6): 308-311.
[13]	HU Ai-Jun, ZHANG Yue. Data Management Approach for Large-scale Mobile Terminal [J]. Computer Engineering, 2012, 38(5): 59-61,66.
[14]	SHU Xiao-Dong, FAN Chong-Dun, YANG Jian-Zheng. Data Mining System for Airport Regional Management [J]. Computer Engineering, 2012, 38(3): 224-227.
[15]	FANG Yuan, DU Chu-Beng, ZHOU Gong-Ye. New Kind of Metadata Management Strategy Based on Object Storage [J]. Computer Engineering, 2012, 38(3): 25-27.

Please choose a citation manager

Content to export