1 |
|
|
|
2 |
|
3 |
|
4 |
|
5 |
|
6 |
|
7 |
|
8 |
|
9 |
|
10 |
|
11 |
|
12 |
|
13 |
|
14 |
|
15 |
|
16 |
|
17 |
|
18 |
|
19 |
|
20 |
|
21 |
|
22 |
|
23 |
|
24 |
|
25 |
|
26 |
|
27 |
|
28 |
|
29 |
|
30 |
|
31 |
|
32 |
|
33 |
LI Z F , PENG B C , WENG C L . XeFlow: streamlining inter-processor pipeline execution for the discrete CPU-GPU platform. IEEE Transactions on Computers, 2020, 69 (6): 819- 831.
doi: 10.1109/TC.2020.2968302
|
34 |
KWON Y , RHU M . A case for memory-centric HPC system architecture for training deep neural networks. IEEE Computer Architecture Letters, 2018, 17 (2): 134- 138.
doi: 10.1109/LCA.2018.2823302
|
35 |
|
36 |
RHU M, GIMELSHEIN N, CLEMONS J, et al. vDNN: virtualized deep neural networks for scalable, memory-efficient neural network design[C]//Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). Washington D.C., USA: IEEE Press, 2016: 1-13.
|
37 |
RHU M, O'CONNOR M, CHATTERJEE N, et al. Compressing DMA engine: leveraging activation sparsity for training deep neural networks[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2018: 78-91.
|
38 |
ZHENG T H, NELLANS D, ZULFIQAR A, et al. Towards high performance paged memory for GPUs[C]//Proceedings of the IEEE International Symposium on High Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2016: 345-357.
|
39 |
|
40 |
AGARWAL N, NELLANS D, STEPHENSON M, et al. Page placement strategies for GPUs within heterogeneous memory systems[C]//Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2015: 607-618.
|
41 |
|
42 |
|
43 |
|
44 |
|
45 |
|
46 |
|
47 |
|
48 |
|
49 |
|
50 |
|
51 |
|
52 |
|
53 |
JUNG J, KIM J, LEE J. DeepUM: tensor migration and prefetching in unified memory[C]//Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2023: 207-221.
|
54 |
GELADO I, STONE J E, CABEZAS J, et al. An asymmetric distributed shared memory model for heterogeneous parallel systems[C]//Proceedings of the 15th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2010: 347-358.
|
55 |
JABLIN T B, PRABHU P, JABLIN J A, et al. Automatic CPU-GPU communication management and optimization[C]//Proceedings of the 32nd ACM SIGPLAN Conference on Programming Language Design and Implementation. New York, USA: ACM Press, 2011: 142-151.
|
56 |
JABLIN T B, JABLIN J A, PRABHU P, et al. Dynamically managed data for CPU-GPU architectures[C]//Proceedings of the 10th International Symposium on Code Generation and Optimization. New York, USA: ACM Press, 2012: 165-174.
|
57 |
PAI S, GOVINDARAJAN R, THAZHUTHAVEETIL M J. Fast and efficient automatic memory management for GPUs using compiler-assisted runtime coherence scheme[C]//Proceedings of the 21st International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA: IEEE Press, 2012: 33-42.
|
58 |
ALSABER N, KULKARNI M. SemCache: semantics-aware caching for efficient GPU offloading[C]//Proceedings of the 27th International ACM Conference on Supercomputing. New York, USA: ACM Press, 2013: 421-432.
|
59 |
WANG L N, YE J M, ZHAO Y Y, et al. SuperNeurons: dynamic GPU memory management for training deep neural networks[C]//Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. New York, USA: ACM Press, 2018: 41-53.
|
60 |
裴威, 李战怀, 潘巍. GPU数据库核心技术综述. 软件学报, 2021, 32 (3): 859- 885.
|
|
PEI W , LI Z H , PAN W . Survey of key technologies in GPU database system. Journal of Software, 2021, 32 (3): 859- 885.
|
61 |
李志方. 异构体系结构上的数据处理加速[D]. 上海: 华东师范大学, 2021.
|
|
LI Z F. Accelerating data processing on the heterogeneous architecture[D]. Shanghai: East China Normal University, 2021. (in Chinese)
|
62 |
DEAN J , GHEMAWAT S . MapReduce: simplified data processing on large clusters. Communications of the ACM, 2008, 51 (1): 107- 113.
doi: 10.1145/1327452.1327492
|
63 |
HE B S, FANG W B, LUO Q, et al. Mars: a MapReduce framework on graphics processors[C]//Proceedings of the International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA: IEEE Press, 2008: 260-269.
|
64 |
|
65 |
|
66 |
|
67 |
|
68 |
JOG A , KAYIRAN O , NACHIAPPAN N C , et al. OWL: cooperative thread array aware scheduling techniques for improving GPGPU performance. ACM SIGPLAN Notices, 2013, 48 (4): 395- 406.
doi: 10.1145/2499368.2451158
|
69 |
JOHNSON T L, MERTEN M C, HWU W W. Run-time spatial locality detection and optimization[C]//Proceedings of 30th Annual International Symposium on Microarchitecture. Washington D.C., USA: IEEE Press, 1997: 57-64.
|
70 |
JOG A, KAYIRAN O, MISHRA A K, et al. Orchestrated scheduling and prefetching for GPGPUs[C]//Proceedings of the 40th Annual International Symposium on Computer Architecture. New York, USA: ACM Press, 2013: 332-343.
|
71 |
GANGULY D, ZHANG Z Y, YANG J, et al. Interplay between hardware prefetcher and page eviction policy in CPU-GPU unified virtual memory[C]//Proceedings of the ACM/IEEE 46th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2019: 224-235.
|
72 |
GANGULY D, ZHANG Z Y, YANG J, et al. Adaptive page migration for irregular data-intensive applications under GPU memory oversubscription[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2020: 451-461.
|
73 |
GANGULY D, MELHEM R, YANG J. An adaptive framework for oversubscription management in CPU-GPU unified memory[C]//Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE). Washington D.C., USA: IEEE Press, 2021: 1212-1217.
|
74 |
HILDRUM K, YU P S. Focused community discovery[C]//Proceedings of the 5th IEEE International Conference on Data Mining. Washington D.C., USA: IEEE Press, 2005: 4.
|
75 |
REN B, AGRAWAL G, LARUS J R, et al. SIMD parallelization of applications that traverse irregular data structures[C]//Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization (CGO). Washington D.C., USA: IEEE Press, 2013: 1-10.
|
76 |
|
77 |
|
78 |
|
79 |
LI C, AUSAVARUNGNIRUN R, ROSSBACH C J, et al. A framework for memory oversubscription management in graphics processing units[C]//Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2019: 49-63.
|
80 |
JIANG S, CHEN F, ZHANG X. CLOCK-Pro: an effective improvement of the clock replacement[C]//Proceedings of the USENIX Annual Technical Conference. Berlin, Germany: Springer, 2005: 323-336.
|
81 |
JALEEL A , THEOBALD K B , STEELY S C , et al. High performance cache replacement using Re-Reference Interval Prediction (RRIP). ACM SIGARCH Computer Architecture News, 2010, 38 (3): 60- 71.
doi: 10.1145/1816038.1815971
|
82 |
YU Q , CHILDERS B , HUANG L B , et al. HPE: hierarchical page eviction policy for unified memory in GPUs. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39 (10): 2461- 2474.
doi: 10.1109/TCAD.2019.2944790
|
83 |
CHE S, BOYER M, MENG J Y, et al. Rodinia: a benchmark suite for heterogeneous computing[C]//Proceedings of the IEEE International Symposium on Workload Characterization (ⅡSWC). Washington D.C., USA: IEEE Press, 2009: 44-54.
|
84 |
|
85 |
GRAUER-GRAY S, XU L F, SEARLES R, et al. Auto-tuning a high-level language targeted to GPU codes[C]//Proceedings of 2012 Innovative Parallel Computing (InPar). Washington D.C., USA: IEEE Press, 2012: 1-10.
|
86 |
YU Q, CHILDERS B, HUANG L B, et al. Coordinated page prefetch and eviction for memory oversubscription management in GPUs[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2020: 472-482.
|
87 |
KIM H, SIM J, GERA P, et al. Batch-aware unified memory management in GPUs for irregular workloads[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 1357-1370.
|
88 |
PARK D, KIM H, HAN H. Page reuse in cyclic thrashing of GPU under oversubscription: work-in-progress[C]//Proceedings of the International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES). Washington D.C., USA: IEEE Press, 2020: 15-16.
|
89 |
LI L D, CHAPMAN B. Compiler assisted hybrid implicit and explicit GPU memory management under unified address space[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York, USA: ACM Press, 2019: 1-16.
|
90 |
CHANG C H, KUMAR A, SIVASUBRAMANIAM A. To move or not to move? Page migration for irregular applications in over-subscribed GPU memory systems with DynaMap[C]//Proceedings of the 14th ACM International Conference on Systems and Storage. New York, USA: ACM Press, 2021: 1-12.
|
91 |
MARKTHUB P, BELVIRANLI M E, LEE S Y, et al. DRAGON: breaking GPU memory capacity limits with direct NVM access[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 414-426.
|
92 |
WU K, REN J, LI D. Runtime data management on non-volatile memory-based heterogeneous memory for task-parallel programs[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA: IEEE Press, 2018: 401-413.
|
93 |
王嘉伦. 基于GPU的大规模数据分析查询的统一内存管理和系统性能优化[D]. 上海: 华东师范大学, 2023.
|
|
WANG J L. Unified memory management and system performance optimization for GPU-based large-scale analytical query processing[D]. Shanghai: East China Normal University, 2023. (in Chinese)
|
94 |
WANG J L , PANG W H , WENG C L , et al. D-Cubicle: boosting data transfer dynamically for large-scale analytical queries in single-GPU systems. Frontiers of Computer Science, 2023, 17 (4): 174610.
doi: 10.1007/s11704-022-2160-z
|
95 |
BAE J, LEE J, JIN Y, et al. FlashNeuron: SSD-enabled large-batch training of very deep neural networks[C]//Proceedings of the 19th USENIX Conference on File and Storage Technologies (FAST 21). Philadelphia, USA: ACL Press, 2021: 387-401.
|
96 |
CHOUKSE E, SULLIVAN M B, O'CONNOR M, et al. Buddy compression: enabling larger memory for deep learning and HPC workloads on GPUs[C]//Proceedings of the ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2020: 926-939.
|
97 |
|
98 |
JAIN A, PHANISHAYEE A, MARS J, et al. Gist: efficient data encoding for deep neural network training[C]//Proceedings of the ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA). Washington D.C., USA: IEEE Press, 2018: 776-789.
|
99 |
|
100 |
|
101 |
PENG X, SHI X H, DAI H L, et al. Capuchin: tensor-based GPU memory management for deep learning[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 891-905.
|
102 |
AWAN A A, CHU C H, SUBRAMONI H, et al. OC-DNN: exploiting advanced unified memory capabilities in CUDA 9 and Volta GPUs for out-of-core DNN training[C]//Proceedings of the 25th International Conference on High Performance Computing (HiPC). Washington D.C., USA: IEEE Press, 2018: 143-152.
|
103 |
HILDEBRAND M, KHAN J, TRIKA S, et al. AutoTM: automatic tensor movement in heterogeneous memory systems using integer linear programming[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 875-890.
|
104 |
HUANG C C, JIN G, LI J Y. SwapAdvisor: pushing deep learning beyond the GPU memory limit via smart swapping[C]//Proceedings of the 25th International Conference on Architectural Support for Programming Languages and Operating Systems. New York, USA: ACM Press, 2020: 1341-1355.
|
105 |
|
106 |
RASLEY J, RAJBHANDARI S, RUWASE O, et al. DeepSpeed: system optimizations enable training deep learning models with over 100 billion parameters[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. New York, USA: ACM Press, 2020: 3505-3506.
|
107 |
REN J, LUO J L, WU K, et al. Sentinel: efficient tensor migration and allocation on heterogeneous memory systems for deep learning[C]//Proceedings of the IEEE International Symposium on High-Performance Computer Architecture (HPCA). Washington D.C., USA: IEEE Press, 2021: 598-611.
|
108 |
CHIEN S, PENG I, MARKIDIS S. Performance evaluation of advanced features in CUDA unified memory[C]//Proceedings of the IEEE/ACM Workshop on Memory Centric High Performance Computing (MCHPC). Washington D.C., USA: IEEE Press, 2019: 50-57.
|
109 |
王鹤澎, 王宏志, 李佳宁, 等. 面向新型处理器的数据密集型计算. 软件学报, 2016, 27 (8): 2048- 2067.
doi: 10.13328/j.cnki.jos.005060
|
|
WANG H P , WANG H Z , LI J N , et al. New processor for data-intensive computing. Journal of Software, 2016, 27 (8): 2048- 2067.
doi: 10.13328/j.cnki.jos.005060
|