[1] DATTA K, MURPHY M, VOLKOV V, et al. Stencil computation optimization and autotuning on state-of-the-art multi core architectures[C]//Proceedings of 2008 ACM/IEEE Conference on Supercomputing. Washington D.C., USA:IEEE Press, 2008:4. [2] SHIMOKAWABE T, AOKI T, TAKAKI T, et al. Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputers[C]//Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D.C., USA:IEEE Press, 2011:3. [3] WIENKE S, SPRINGER P, TERBOVEN C, et al. OpenACC-first experiences with real-world applications[C]//Proceedings of European Conference on Parallel Processing. Berlin, Germany:Springer, 2012:859-870. [4] GROPP W, LUSK E, SKJELLUM A. Using MPI:portable parallel programming with the message-passing interface[M]. Cambridge, USA:MIT Press, 1994. [5] GABRIEL E, FAGG G E, BOSILCA G, et al. Open MPI:goals, concept, and design of a next generation MPI implementation[C]//Proceedings of European Parallel Virtual Machine/Message Passing Interface Users'Group Meeting. Berlin, Germany:Springer, 2004:97-104. [6] STELLNER G. CoCheck:checkpointing and process migration for MPI[C]//Proceedings of International Conference on Parallel Processing. Washington D.C., USA:IEEE Press, 1996:526-531. [7] OH T, BEARD S R, JOHNSON N P, et al. A generalized framework for automatic scripting language parallelization[C]//Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques (PACT). Washington D.C., USA:IEEE Press, 2017:356-369. [8] MORENO A, RODRÍGUEZ J J, BELTRÁN D, et al. Designing a benchmark for the performance evaluation of agent-based simulation applications on HPC[J]. The Journal of Supercomputing, 2019, 75(3):1524-1550. [9] RODRIGUEZ M, BRUALLA L. Many-Integrated Core (MIC) technology for accelerating Monte Carlo simulation of radiation transport:a study based on the code DPM[J]. Computer Physics Communications, 2018, 225:28-35. [10] KIRK D. NVIDIA CUDA software and GPU parallel computing architecture[C]//Proceedings of the 6th International Symposium on Memory Management. New York, USA:ACM Press, 2007:103-104. [11] YANG Z Y, ZHU Y T, PU Y. Parallel image processing based on CUDA[C]//Proceedings of International Conference on Computer Science and Software Engineering. Washington D.C., USA:IEEE Press, 2008:198-201. [12] SANDERS J, KANDROT E. CUDA by example:an introduction to general-purpose GPU programming[M]. Upper Saddle River, USA:Addison-Wesley, 2011 [13] 林琳,祝爱琦,赵明璨,等.晶硅分子动力学模拟的GPU加速算法优化[J].计算机工程, 2023, 49(4):166-173. LIN L, ZHU A Q, ZHAO M C, et al. GPU-accelerated algorithm optimization for molecular dynamics simulation of crystalline silicon[J]. Computer Engineering, 2023, 49(4):166-173.(in Chinese) [14] 韩彦岭,沈思扬,徐利军,等.面向深度学习图像分类的GPU并行方法研究[J].计算机工程. 2023, 49(1):191-200. HAN Y L,SHEN S Y,XU L J,et al.GPU parallel method for deep learning image classification[J].Computer Engineering, 2023,49(1):191-200.(in Chinese) [15] CADENELLI N, JAKŠIĆ Z, POLO J, et al. Considerations in using OpenCL on GPUs and FPGAs for throughput-oriented genomics workloads[J]. Future Generation Computer Systems, 2019, 94:148-159. [16] KUMAR J S, KUMAR G S, AHILAN A. High performance decoding aware FPGA bit-stream compression using RG codes[J]. Cluster Computing, 2019, 22(6):15007-15013. [17] 李博,黄东强,贾金芳,等.基于CPU与GPU的异构模板计算优化研究[J].计算机工程, 2023, 49(4):131-137. LI B, HUANG D Q, JIA J F, et al. Research on optimization of heterogeneous stencil computing based on CPU and GPU[J]. Computer Engineering, 2023, 49(4):131-137.(in Chinese) [18] HOLEWINSKI J, POUCHET L N, SADAYAPPAN P. High-performance code generation for stencil computations on GPU architectures[C]//Proceedings of the 26th ACM International Conference on Supercomputing. New York, USA:ACM Press, 2012:311-320. [19] MATSUMURA K, ZOHOURI H R, WAHIB M, et al. AN5D:automated stencil framework for high-degree temporal blocking on GPUs[C]//Proceedings of the 18th ACM/IEEE International Symposium on Code Generation and Optimization. New York, USA:ACM Press, 2020:199-211. [20] VERDOOLAEGE S, JUEGA J C, COHEN A, et al. Polyhedral parallel code generation for CUDA[J]. ACM Transactions on Architecture and Code Optimization, 2013, 9(4):54. [21] NAWATA T, SUDA R. APTCC:auto parallelizing translator from C to CUDA[J]. Procedia Computer Science, 2011, 4:352-361. [22] FONSECA A, CABRAL B, RAFAEL J, et al. Automatic parallelization:executing sequential programs on a task-based parallel runtime[J]. International Journal of Parallel Programming, 2016, 44(6):1337-1358. [23] ZHANG Y Q, CAO T, LI S G,et al. Parallel processing systems for big data:a survey[J]. Proceedings of the IEEE, 2016, 104(11):2114-2136. [24] HAGEDORN B, STOLTZFUS L, STEUWER M, et al. High performance stencil code generation with Lift[C]//Proceedings of 2018 International Symposium on Code Generation and Optimization. New York, USA:ACM Press, 2018:100-112. [25] HUANG X M, HUANG X, WANG D, et al. OpenArray v1.0:a simple operator library for the decoupling of ocean modelling and parallel computing[EB/OL].[2023-07-11]. https://www.semanticscholar.org/paper/OpenArray-v1.0%3A-a-simple-operator-library-for-the-Huang-Huang/6e4833c2f6ceb2aeddd10acf185bce735b05aaf2. [26] TURCHETTO M, PALU A D, VACONDIO R. A general design for a scalable MPI-GPU multi-resolution 2D numerical solver[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(5):1036-1047. [27] 李雁冰,赵荣彩,韩林,等.一种面向异构众核处理器的并行编译框架[J].软件学报, 2019, 30(4):981-1001. LI Y B, ZHAO R C, HAN L, et al. Parallelizing compilation framework for heterogeneous many-core processors[J]. Journal of Software, 2019, 30(4):981-1001.(in Chinese) [28] PEARSON C, CHUNG I, XIONG J J, et al. Fast CUDA-aware MPI datatypes without platform support[EB/OL].[2023-07-11]. https://arxiv.org/abs/2012.14363v2. [29] PEKKILÄ J, VÄISÄLÄ M S, KÄPYLÄ M J, et al. Scalable communication for high-order stencil computations using CUDA-aware MPI[J]. Parallel Computing, 2022, 111:102904. [30] LI A, SONG S L, CHEN J Y, et al. Evaluating modern GPU interconnect:PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect[J]. IEEE Transactions on Parallel and Distributed Systems, 2020, 31(1):94-110. |