作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2021, Vol. 47 ›› Issue (8): 131-139. doi: 10.19678/j.issn.1000-3428.0058071

• 先进计算与数据处理 • 上一篇    下一篇

面向异构架构的传递闭包并行算法

肖汉1,3, 郭宝云2, 李彩林2, 周清雷3   

  1. 1. 郑州师范学院 信息科学与技术学院, 郑州 450044;
    2. 山东理工大学 建筑工程学院, 山东 淄博 255000;
    3. 郑州大学 信息工程学院, 郑州 450001
  • 收稿日期:2020-04-15 修回日期:2020-07-16 发布日期:2021-08-14
  • 作者简介:肖汉(1970-),男,教授、博士后,主研方向为大规模并行算法、遥感大数据并行处理;郭宝云(通信作者),讲师、博士;李彩林,副教授、博士;周清雷,教授、博士、博士生导师。
  • 基金资助:
    国家自然科学基金(41601496,41701525,61572444);山东省自然科学基金(ZR2017LD002);山东省重点研发计划项目(2018GGX106002)。

Parallel Transitive Closure Algorithm for Heterogeneous Architecture

XIAO Han1,3, GUO Baoyun2, LI Cailin2, ZHOU Qinglei3   

  1. 1. School of Information Science and Technology, Zhengzhou Normal University, Zhengzhou 450044, China;
    2. School of Civil and Architectural Engineering, Shandong University of Technology, Zibo, Shandong 255000, China;
    3. School of Information Engineering, Zhengzhou University, Zhengzhou 450001, China
  • Received:2020-04-15 Revised:2020-07-16 Published:2021-08-14

摘要: 传统求图传递闭包的方法存在计算量大与计算时间长的问题。为加快处理大数据量的传递闭包算法的计算速度,结合算法密集计算和开放式计算语言(OpenCL)框架的特征,采用本地存储器优化的并行子矩阵乘和分块的矩阵乘并行计算,提出一种基于OpenCL的传递闭包并行算法。利用本地存储器优化的并行子矩阵乘算法来优化计算步骤,提高图形处理器(GPU)的存储器利用率,降低数据获取延迟。通过分块矩阵乘并行计算算法实现大数据量的矩阵乘,提高GPU计算核心的利用率。数据结果表明,与CPU串行算法、基于开放多处理的并行算法和基于统一设备计算架构的并行算法相比,传递闭包并行算法在OpenCL架构下NVIDIA GeForce GTX 1070计算平台上分别获得了593.14倍、208.62倍和1.05倍的加速比。

关键词: 矩阵乘, 传递闭包, 图形处理器, 开放式计算语言, 并行算法

Abstract: The traditional method for obtaining the transitive closure of the graphs faces the large amount of calculation and long calculation time. In order to improve the computing speed of the transitive closure algorithm for dealing with large amounts of data, an Open Computing Language(OpenCL)-based parallel algorithm for transitive closure is proposed. The algorithm combines the characteristics of algorithm-intensive computation and OpenCL architecture, and uses the parallel submatrix multiplication and block matrix multiplication optimized by local memory for parallel computing. The parallel submatrix multiplication algorithm is used to optimize the computational steps, improves the memory utilization of the Graphic Processing Unit(GPU), and reduces the data acquisition delay. The parallel block matrix multiplication algorithm is used to implement matrix multiplication involving large amounts of data, and improve the utilization of the GPU computing cores. The experimental results show that compared with the sequential CPU-based algorithm, parallel algorithm based on Open Multi-Processing, and parallel algorithm based on Compute Unified Device Architecture(CUDA), the proposed parallel transitive closure algorithm provides a 593.14 times, 208.62 times and 1.05 times speedup respectively on the NVIDIA GeForce GTX 1070 computing platform with OpenCL architecture.

Key words: matrix multiplication, transitive closure, Graphic Processing Unit(GPU), Open Computing Language (OpenCL), parallel algorithm

中图分类号: