作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (4): 131-137. doi: 10.19678/j.issn.1000-3428.0064282

• 先进计算与数据处理 • 上一篇    下一篇

基于CPU与GPU的异构模板计算优化研究

李博1, 黄东强1, 贾金芳1, 吴利1, 王晓英1, 黄建强1,2   

  1. 1. 青海大学 计算机技术与应用系, 西宁 810016;
    2. 清华大学 计算机科学与技术系, 北京 100084
  • 收稿日期:2022-03-23 修回日期:2022-05-05 发布日期:2022-06-20
  • 作者简介:李博(1998-),男,硕士研究生,主研方向为高性能计算;黄东强,硕士研究生;贾金芳、吴利,讲师、硕士;王晓英,教授;黄建强(通信作者),教授,博士生导师。
  • 基金资助:
    青海省科技厅应用基础研究项目(2022-ZJ-701);国家自然科学基金(62062059,62162053);青海省“昆仑英才·高端创新创业人才”项目;教育部“春晖计划”合作科研项目(QDCH2018001);青海大学2021年研究生课程建设项目(qdyk-210413);青海大学2021年度青年科研基金项目(2021-QGY-13);青海省骨干教师项目;清华大学-宁夏银川水联网数字治水联合研究院横向课题(SKL-IOW-2020TC2004-01)。

Research on Optimization of Heterogeneous Stencil Computing Based on CPU and GPU

LI Bo1, HUANG Dongqiang1, JIA Jinfang1, WU Li1, WANG Xiaoying1, HUANG Jianqiang1,2   

  1. 1. Department of Computer Technology and Applications, Qinghai University, Xining 810016, China;
    2. Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
  • Received:2022-03-23 Revised:2022-05-05 Published:2022-06-20

摘要: 模板计算是一类使用固定模板的算法,被广泛应用于图像处理、计算流体动力学模拟等领域,现有的模板计算存在计算并行度弱、缓存命中率低、无法充分利用计算资源等问题。在消息传递接口(MPI)计算模型和跨平台多线程(OpenMP)计算模型的基础上提出MPI+OpenMP、统一计算设备架构(CUDA)+OpenMP两种混合计算模型。相较于常规的MPI计算模型,MPI+OpenMP计算模型通过使用MPI进行多节点之间的粗粒度通信,使用OpenMP实现进程内部的细粒度并行计算,并结合单指令多数据、非一致内存访问、数据预取、数据分块等技术,提高模板计算过程中的缓存命中率与计算并行能力,加快计算速度。在只采用CUDA进行模板计算时,CPU的计算资源没有得到充分利用,浪费了大量计算资源,CUDA+OpenMP计算模型通过对计算任务的负载划分让CPU也参与到计算中,以减少通信开销及充分利用CPU的多核并行计算能力。实验结果表明,OpenMP+MPI计算模型相较于MPI计算模型的平均加速比为3.67,CUDA+OpenMP计算模型相较于CUDA计算模型的平均加速比为1.26,OpenMP+MPI和CUDA+OpenMP两种计算模型的性能均得到了显著提升。

关键词: 模板计算, 消息传递接口, 跨平台多线程, 单指令多数据, 非一致内存访问, 统一计算设备架构

Abstract: As a type of algorithm that uses fixed pattern templates, stencil computing is widely employed in image processing, computational fluid dynamics simulations, and other fields.However, existing stencil computing approaches exhibit problems such as weak computational parallelism, a low cache hit rate, and insufficient utilization of computing resources.Two hybrid computing models-MPI+OpenMP and Compute Unified Device Architecture(CUDA)+OpenMP-are proposed based on the Message Passing Interface(MPI)and Open Multi-Processing(OpenMP) computing models.Unlike the conventional MPI computing model, the MPI+OpenMP model employs MPI for coarse-grained communication between multiple nodes and OpenMP to achieve fine-grained parallel computing throughout the process.Furthermore, it combines Single Instruction Multiple Data (SIMD), Non Uniform Memory Access(NUMA), data prefetching, data partitioning, and other technologies to improve the cache hit rate and parallelization in the stencil computing process, thereby accelerating it.When only CUDA is used for stencil calculation, the CPU's computing resources are not fully utilized, with a large quantity of them being wasted.In contrast, CUDA+OpenMP allows the CPU to participate in the calculation by splitting the load of computing tasks, reducing communication costs, and making full use of the CPU's multi-core parallel computing ability.Experimental results show that the average acceleration ratio between the OpenMP+MPI and MPI models is 3.67, whereas that between the CUDA+OpenMP and CUDA models is 1.26. OpenMP+MPI and CUDA+OpenMP exhibit significant improvements in performance.

Key words: stencil computing, Message Passing Interface(MPI), Open Multi-Processing(OpenMP), Single Instruction Multiple Data(SIMD), Non Uniform Memory Access(NUMA), Compute Unified Device Architecture (CUDA)

中图分类号: