作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (9): 216-225. doi: 10.19678/j.issn.1000-3428.0068183

• 先进计算与数据处理 • 上一篇    下一篇

基于国产异构平台的奇异值分解法

杨太龙1, 赵红朋2, 张磊2,*()   

  1. 1. 郑州大学计算机与人工智能学院, 河南 郑州 450001
    2. 曙光信息产业(北京)有限公司, 北京 100193
  • 收稿日期:2023-08-04 出版日期:2024-09-15 发布日期:2024-01-26
  • 通讯作者: 张磊
  • 基金资助:
    国家重点研发计划(2021YFB0300200)

Singular Value Decomposition Method Based on Domestic Heterogeneous Platforms

YANG Tailong1, ZHAO Hongpeng2, ZHANG Lei2,*()   

  1. 1. School of Computerand ArtificialIntelligence, Zhengzhou University, Zhengzhou450001, Henan, China
    2. DawningInformationIndustry (Beijing) Co., Ltd., Beijing100193, China
  • Received:2023-08-04 Online:2024-09-15 Published:2024-01-26
  • Contact: ZHANG Lei

摘要:

随着深度学习等高算力应用的发展, 异构计算正在逐步成为并行计算的重要方向。国产异构平台近年来发展迅速, 针对国产平台的架构定制开发适配的算法与软件有着重要意义。奇异值分解(SVD)作为线性代数库中用于处理一般矩阵的强大分解器, 应用在科学计算、人工智能、信号处理等众多领域。现有某类国产加速器的可用库中SVD算法性能远低于NVIDIA, 这对相关应用的高效移植带来了挑战。为此, 通过调整算法流程减少线程启动与访存开销, 提出了面向国产加速器的矩阵双对角化方法mySVD。卸载计算密集型任务到加速器, 设计面向国产异构平台的分治算法; 通过CPU+加速器多流, 提出了任务并行的奇异向量矩阵生成方法。最终形成一套奇异值算法的高效移植优化方案。实验结果表明, 该方案在不同的测试矩阵规模上, 性能最高达到现有的商业闭源线性代数库MKL的9.8倍, 以及现有开源异构计算线性代数库MAGMA的5.5倍。最终将其用于图像处理, 并跨平台与MATLAB、NVIDIA公司的GPU线性代数库CUSOLVER进行对比, 其具有更快的速度且生成的图像与原图像相似度更高。

关键词: 并行计算, 异构计算, 奇异值分解, 国产平台, 图像处理

Abstract:

With the development of high computing power applications, such as deep learning, heterogeneous computing is gradually becoming an important direction for parallel computing. Domestic heterogeneous platforms have been developed rapidly in recent years. Therefore, customizing and developing adaptive algorithms and software for domestic platform architectures, have acquired great significance. Singular Value Decomposition (SVD) is a powerful technique used in linear algebra libraries for processing general matrices, with applications in many fields, such as scientific computing, artificial intelligence, and signal processing. However, the performance of the SVD algorithm in the available library of a domestic accelerator is far inferior to that of NVIDIA, posing a challenge for the efficient porting of related applications. To this end, a matrix diagonalization method, mySVD, is proposed for domestic accelerators by adjusting the algorithmic flow to reduce the thread startup and memory access overhead. Computationally intensive tasks are unloaded to accelerators and a divide and conquer algorithm is designed for domestic heterogeneous platforms, whereby a multi-stream parallel task singular vector matrix generation method is proposed through CPU+accelerator. Finally, an efficient transplant optimization scheme is developed for singular value algorithms. The experimental results show that this scheme achieves the highest performance of 9.8 times that of the existing commercial closed-source linear algebra library MKL and 5.5 times that of the existing Open-Source heterogeneous computing linear algebra library MAGMA at different matrix scales. Finally, the proposed algorithm is applied to image processing and compared across platforms using MATLAB and GPU linear algebra library of NVIDIA, CUSOLVER. The algorithm demonstrates an increase in speed and generates images highly similar to the original.

Key words: parallel computing, heterogeneous computing, Singular Value Decomposition(SVD), domestic platform, image processing