基于CPU与GPU的异构模板计算优化研究

doi:10.19678/j.issn.1000-3428.0064282

计算机工程 ›› 2023, Vol. 49 ›› Issue (4): 131-137. doi: 10.19678/j.issn.1000-3428.0064282

基于CPU与GPU的异构模板计算优化研究

李博¹, 黄东强¹, 贾金芳¹, 吴利¹, 王晓英¹, 黄建强^1,2

1. 青海大学计算机技术与应用系, 西宁 810016;
2. 清华大学计算机科学与技术系, 北京 100084

收稿日期:2022-03-23 修回日期:2022-05-05 发布日期:2022-06-20
作者简介:李博(1998-),男,硕士研究生,主研方向为高性能计算;黄东强,硕士研究生;贾金芳、吴利,讲师、硕士;王晓英,教授;黄建强(通信作者),教授,博士生导师。
基金资助:
青海省科技厅应用基础研究项目（2022-ZJ-701）；国家自然科学基金（62062059，62162053）；青海省“昆仑英才·高端创新创业人才”项目；教育部“春晖计划”合作科研项目（QDCH2018001）；青海大学2021年研究生课程建设项目（qdyk-210413）；青海大学2021年度青年科研基金项目（2021-QGY-13）；青海省骨干教师项目；清华大学-宁夏银川水联网数字治水联合研究院横向课题（SKL-IOW-2020TC2004-01）。

Research on Optimization of Heterogeneous Stencil Computing Based on CPU and GPU

LI Bo¹, HUANG Dongqiang¹, JIA Jinfang¹, WU Li¹, WANG Xiaoying¹, HUANG Jianqiang^1,2

1. Department of Computer Technology and Applications, Qinghai University, Xining 810016, China;
2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China

Received:2022-03-23 Revised:2022-05-05 Published:2022-06-20

摘要/Abstract

摘要： 模板计算是一类使用固定模板的算法，被广泛应用于图像处理、计算流体动力学模拟等领域，现有的模板计算存在计算并行度弱、缓存命中率低、无法充分利用计算资源等问题。在消息传递接口（MPI）计算模型和跨平台多线程（OpenMP）计算模型的基础上提出MPI+OpenMP、统一计算设备架构（CUDA）+OpenMP两种混合计算模型。相较于常规的MPI计算模型，MPI+OpenMP计算模型通过使用MPI进行多节点之间的粗粒度通信，使用OpenMP实现进程内部的细粒度并行计算，并结合单指令多数据、非一致内存访问、数据预取、数据分块等技术，提高模板计算过程中的缓存命中率与计算并行能力，加快计算速度。在只采用CUDA进行模板计算时，CPU的计算资源没有得到充分利用，浪费了大量计算资源，CUDA+OpenMP计算模型通过对计算任务的负载划分让CPU也参与到计算中，以减少通信开销及充分利用CPU的多核并行计算能力。实验结果表明，OpenMP+MPI计算模型相较于MPI计算模型的平均加速比为3.67，CUDA+OpenMP计算模型相较于CUDA计算模型的平均加速比为1.26，OpenMP+MPI和CUDA+OpenMP两种计算模型的性能均得到了显著提升。

关键词: 模板计算, 消息传递接口, 跨平台多线程, 单指令多数据, 非一致内存访问, 统一计算设备架构

Abstract: As a type of algorithm that uses fixed pattern templates, stencil computing is widely employed in image processing, computational fluid dynamics simulations, and other fields.However, existing stencil computing approaches exhibit problems such as weak computational parallelism, a low cache hit rate, and insufficient utilization of computing resources.Two hybrid computing models-MPI+OpenMP and Compute Unified Device Architecture(CUDA)+OpenMP-are proposed based on the Message Passing Interface(MPI)and Open Multi-Processing(OpenMP) computing models.Unlike the conventional MPI computing model, the MPI+OpenMP model employs MPI for coarse-grained communication between multiple nodes and OpenMP to achieve fine-grained parallel computing throughout the process.Furthermore, it combines Single Instruction Multiple Data (SIMD), Non Uniform Memory Access(NUMA), data prefetching, data partitioning, and other technologies to improve the cache hit rate and parallelization in the stencil computing process, thereby accelerating it.When only CUDA is used for stencil calculation, the CPU's computing resources are not fully utilized, with a large quantity of them being wasted.In contrast, CUDA+OpenMP allows the CPU to participate in the calculation by splitting the load of computing tasks, reducing communication costs, and making full use of the CPU's multi-core parallel computing ability.Experimental results show that the average acceleration ratio between the OpenMP+MPI and MPI models is 3.67, whereas that between the CUDA+OpenMP and CUDA models is 1.26. OpenMP+MPI and CUDA+OpenMP exhibit significant improvements in performance.

Key words: stencil computing, Message Passing Interface(MPI), Open Multi-Processing(OpenMP), Single Instruction Multiple Data(SIMD), Non Uniform Memory Access(NUMA), Compute Unified Device Architecture (CUDA)

中图分类号:

TP393

李博, 黄东强, 贾金芳, 吴利, 王晓英, 黄建强. 基于CPU与GPU的异构模板计算优化研究[J]. 计算机工程, 2023, 49(4): 131-137.

LI Bo, HUANG Dongqiang, JIA Jinfang, WU Li, WANG Xiaoying, HUANG Jianqiang. Research on Optimization of Heterogeneous Stencil Computing Based on CPU and GPU[J]. Computer Engineering, 2023, 49(4): 131-137.

https://www.ecice06.com/CN/Y2023/V49/I4/131

图/表 11

20230417184906

20230417184911

20230417184940

20230417184943

20230417184946

20230417184949

20230417184952

20230417184955

20230417184959

20230417185002

20230417185006

参考文献

[1] DATTA K, MURPHY M, VOLKOV V, et al.Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures[C]//Proceedings of 2008 ACM/IEEE Conference on Supercomputing.Washington D.C., USA:IEEE Press, 2009:1-12.
[2] KRISHNAMOORTHY S, BASKARAN M, BONDHUGULA U, et al.Effective automatic parallelization of stencil computations[J].ACM SIGPLAN Notices, 2007, 42(6):235-244.
[3] HUANG J Q, HAN W T, WANG X Y, et al.Heterogeneous parallel algorithm design and performance optimization for WENO on the Sunway Taihulight supercomputer[J].Tsinghua Science and Technology, 2019, 25(1):56-67.
[4] ZHANG K F, SU H Y, DOU Y.Multilevel parallelism optimization of stencil computations on SIMDlized NUMA architectures[J].The Journal of Supercomputing, 2021, 77(11):13584-13600.
[5] MENEGHIN M, MAHMOUD A H, JAYARAMAN P K, et al.Neon:a multi-GPU programming model for grid-based computations[C]//Proceedings of IEEE International Parallel and Distributed Processing Symposium.Washington D.C., USA:IEEE Press, 2022:817-827.
[6] LI K, YUAN L, ZHANG Y Q, et al.Reducing redundancy in data organization and arithmetic calculation for stencil computations[C]//Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis.Washington D.C., USA:IEEE Press, 2022:1-15.
[7] SHEN J, WU Y, OKITA M, et al.Accelerating GPU-based out-of-core stencil computation with on-the-fly compression[EB/OL].[2022-02-20].https://arxiv.org/abs/2109.05410.
[8] PEARSON C, HIDAYETOĞLU M, ALMASRI M, et al.Node-aware stencil communication for heterogeneous supercomputers[C]//Proceedings of International Parallel and Distributed Processing Symposium Workshops.Washington D.C., USA:IEEE Press, 2020:796-805.
[9] SULAIMAN M, HALIM Z, WAQAS M, et al.A hybrid list-based task scheduling scheme for heterogeneous computing[J].The Journal of Supercomputing, 2021, 77(9):10252-10288.
[10] BRODTKORB A R, DYKEN C, HAGEN T R, et al.State-of-the-art in heterogeneous computing[J].Scientific Programming, 2010, 18(1):1-33.
[11] CHANG L W, GÓMEZ-LUNA J, EL HAJJ I, et al.Collaborative computing for heterogeneous integrated systems[C]//Proceedings of the 8th ACM/SPEC on International Conference on Performance Engineering.New York, USA:ACM Press, 2017:385-388.
[12] GAN L, FU H H, XUE W, et al.Scaling and analyzing the stencil performance on multi-core and many-core architectures[C]//Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems.Washington D.C., USA:IEEE Press, 2015:103-110.
[13] FAIZUR RAHMAN S M, YI Q, QASEM A.Understanding stencil code performance on multicore architectures[C]//Proceedings of the 8th ACM International Conference on Computing Frontiers.New York, USA:ACM Press, 2011:1-10.
[14] SMITH L, BULL M.Development of mixed mode MPI/OpenMP applications[J].Scientific Programming, 2001, 9(3):83-98.
[15] LI D, DE SUPINSKI B R, SCHULZ M, et al.Hybrid MPI/OpenMP power-aware computing[C]//Proceedings of IEEE International Symposium on Parallel & Distributed Processing.Washington D.C., USA:IEEE Press, 2010:1-12.
[16] 张琨, 贾金芳, 严文昕, 等.GRAPES动力框架中大规模稀疏线性系统并行求解及优化[J].计算机工程, 2022, 48(1):149-154, 162. ZHANG K, JIA J F, YAN W X, et al.Parallel solution and optimization of large-scale sparse linear system in GRAPES dynamic framework[J].Computer Engineering, 2022, 48(1):149-154, 162.(in Chinese)
[17] DAGUM L, MENON R.OpenMP:an industry standard API for shared-memory programming[J].IEEE Computational Science and Engineering, 1998, 5(1):46-55.
[18] GABRIEL E, FAGG G E, BOSILCA G, et al.Open MPI:goals, concept, and design of a next generation MPI implementation[M]//KRANZLMÜLLER D, KACSUK P, DONGARRA J.Recent Advances in Parallel Virtual Machine and Message Passing Interface.Berlin, Germany:Springer, 2004:97-104.
[19] DE SUPINSKI B R, SCOGLAND T R W, DURAN A, et al.The ongoing evolution of OpenMP[J].Proceedings of the IEEE, 2018, 106(11):2004-2019.
[20] KWEDLO W, CZOCHANSKI P J.A hybrid MPI/OpenMP parallelization of means algorithms accelerated using the triangle inequality[J].IEEE Access, 2019, 7:42280-42297.
[21] ZHENG R H, PAI S.Efficient execution of graph algorithms on CPU with SIMD extensions[C]//Proceedings of IEEE/ACM International Symposium on Code Generation and Optimization.Washington D.C., USA:IEEE Press, 2021:262-276.
[22] ZHONG D, CAO Q L, BOSILCA G, et al.Using advanced vector extensions AVX-512 for MPI reductions[C]//Proceedings of the 27th European MPI Users'Group Meeting.New York, USA:ACM Press, 2020:1-10.
[23] BIAN H, HUANG J, LIU L, et al.ALBUS:a method for efficiently processing SpMV using SIMD and load balancing[J].Future Generation Computer Systems, 2021, 116:371-392.
[24] 郭渝洛, 边浩东, 董润婷, 等.基于SIMD的并行傅里叶空间图像相似度计算[J].计算机工程, 2021, 47(11):247-253. GUO Y L, BIAN H D, DONG R T, et al.Parallel Fourier space image similarity calculation based on SIMD[J].Computer Engineering, 2021, 47(11):247-253.(in Chinese)
[25] GARLAND M, LE GRAND S, NICKOLLS J, et al.Parallel computing experiences with CUDA[J].IEEE Micro, 2008, 28(4):13-27.
[26] BUCK I.GPU computing with NVIDIA CUDA[C]//Proceedings of ACM SIGGRAPH 2007 Courses.New York, USA:ACM Press, 2007:6-12.
[27] 徐国伟, 陈建, 成怡.基于GPU并行计算的雷达杂波模拟研究[J].计算机工程, 2020, 46(11):306-314. XU G W, CHEN J, CHENG Y.Research on radar clutter simulation based on GPU parallel computing[J].Computer Engineering, 2020, 46(11):306-314.(in Chinese)
[28] CHOQUETTE J, GANDHI W.NVIDIA A100 GPU:performance & innovation for GPU computing[C]//Proceedings of IEEE Hot Chips 32 Symposium.Washington D.C., USA:IEEE Press, 2020:1-43.
[29] NARASIMAN V, SHEBANOW M, LEE C J, et al.Improving GPU performance via large warps and two-level warp scheduling[C]//Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture.Washington D.C., USA:IEEE Press, 2017:308-317.
[30] 肖汉, 郭宝云, 李彩林, 等.面向异构架构的传递闭包并行算法[J].计算机工程, 2021, 47(8):131-139. XIAO H, GUO B Y, LI C L, et al.Parallel transitive closure algorithm for heterogeneous architecture[J].Computer Engineering, 2021, 47(8):131-139.(in Chinese)
[31] MITTAL S, VETTER J S.A survey of CPU-GPU heterogeneous computing techniques[J].ACM Computing Surveys, 2015, 47(4):1-35.

选择文件类型/文献管理软件名称

选择包含的内容

基于CPU与GPU的异构模板计算优化研究

Research on Optimization of Heterogeneous Stencil Computing Based on CPU and GPU

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	刘金硕, 文尧. 模板运算代码的自动生成与调优框架[J]. 计算机工程, 2024, 50(6): 35-47.
[2]	曾灵灵, 张敦博, 沈立, 窦强. 便笺式存储器中一种新颖的交错映射数据布局[J]. 计算机工程, 2024, 50(5): 33-40.
[3]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[4]	刘康, 万伟, 刘波, 李俊宏, 李柱. 基于“嵩山”超级计算机的UCX库分析与优化[J]. 计算机工程, 2023, 49(12): 274-281.
[5]	杨周凡, 韩林, 李冰洋, 谢景明, 韩璞, 刘勇杰. 基于“嵩山”超级计算机系统的大规模管网仿真[J]. 计算机工程, 2022, 48(9): 155-161.
[6]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[7]	卢嘉嘉, 杜育松. 整数上离散高斯取样的常数时间实现方法[J]. 计算机工程, 2020, 46(8): 119-123.
[8]	周琦,柴小丽,马克杰,俞则人. 基于CUDA与CUBLAS的Tucker分解模块设计与实现[J]. 计算机工程, 2019, 45(3): 41-46.
[9]	彭振,吴百锋. 基于数据并行的碰撞检测[J]. 计算机工程, 2017, 43(9): 1-6.
[10]	韩林,高伟,王冬,王鹏翔,李颖颖. 一种单指令多数据向量化归约方法[J]. 计算机工程, 2017, 43(7): 9-14.
[11]	陆思羽,王宏伟,张悠慧,杨广文,郑纬民. 面向MPI集合操作的定制化片上网络[J]. 计算机工程, 2017, 43(6): 1-10,18.
[12]	陈勇,吴晓民,杨坚,奚宏生. 基于CUDA的H.264并行解码器设计与实现[J]. 计算机工程, 2016, 42(5): 249-252,257.
[13]	裴鑫,聂俊,陈卯蒸,李健. 基于混合架构的双通道实时相关器实现[J]. 计算机工程, 2016, 42(5): 42-46,53.
[14]	孟小华,覃大胜,郑冬琴,周玉宇. 基于GPU 的碳纳米管分子动力学并行仿真[J]. 计算机工程, 2015, 41(4): 288-293.
[15]	杨先凤,李映洁,赖俊良,彭博. 基于GPU并行粒子群优化的超声弹性实时成像算法[J]. 计算机工程, 2015, 41(12): 220-225,230.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于CPU与GPU的异构模板计算优化研究

Research on Optimization of Heterogeneous Stencil Computing Based on CPU and GPU

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 11

参考文献

相关文章 15

编辑推荐

Metrics

本文评价