基于国产异构平台的奇异值分解法

doi:10.19678/j.issn.1000-3428.0068183

摘要/Abstract

摘要：

随着深度学习等高算力应用的发展, 异构计算正在逐步成为并行计算的重要方向。国产异构平台近年来发展迅速, 针对国产平台的架构定制开发适配的算法与软件有着重要意义。奇异值分解(SVD)作为线性代数库中用于处理一般矩阵的强大分解器, 应用在科学计算、人工智能、信号处理等众多领域。现有某类国产加速器的可用库中SVD算法性能远低于NVIDIA, 这对相关应用的高效移植带来了挑战。为此, 通过调整算法流程减少线程启动与访存开销, 提出了面向国产加速器的矩阵双对角化方法mySVD。卸载计算密集型任务到加速器, 设计面向国产异构平台的分治算法; 通过CPU+加速器多流, 提出了任务并行的奇异向量矩阵生成方法。最终形成一套奇异值算法的高效移植优化方案。实验结果表明, 该方案在不同的测试矩阵规模上, 性能最高达到现有的商业闭源线性代数库MKL的9.8倍, 以及现有开源异构计算线性代数库MAGMA的5.5倍。最终将其用于图像处理, 并跨平台与MATLAB、NVIDIA公司的GPU线性代数库CUSOLVER进行对比, 其具有更快的速度且生成的图像与原图像相似度更高。

关键词: 并行计算, 异构计算, 奇异值分解, 国产平台, 图像处理

Abstract:

With the development of high computing power applications, such as deep learning, heterogeneous computing is gradually becoming an important direction for parallel computing. Domestic heterogeneous platforms have been developed rapidly in recent years. Therefore, customizing and developing adaptive algorithms and software for domestic platform architectures, have acquired great significance. Singular Value Decomposition (SVD) is a powerful technique used in linear algebra libraries for processing general matrices, with applications in many fields, such as scientific computing, artificial intelligence, and signal processing. However, the performance of the SVD algorithm in the available library of a domestic accelerator is far inferior to that of NVIDIA, posing a challenge for the efficient porting of related applications. To this end, a matrix diagonalization method, mySVD, is proposed for domestic accelerators by adjusting the algorithmic flow to reduce the thread startup and memory access overhead. Computationally intensive tasks are unloaded to accelerators and a divide and conquer algorithm is designed for domestic heterogeneous platforms, whereby a multi-stream parallel task singular vector matrix generation method is proposed through CPU+accelerator. Finally, an efficient transplant optimization scheme is developed for singular value algorithms. The experimental results show that this scheme achieves the highest performance of 9.8 times that of the existing commercial closed-source linear algebra library MKL and 5.5 times that of the existing Open-Source heterogeneous computing linear algebra library MAGMA at different matrix scales. Finally, the proposed algorithm is applied to image processing and compared across platforms using MATLAB and GPU linear algebra library of NVIDIA, CUSOLVER. The algorithm demonstrates an increase in speed and generates images highly similar to the original.

Key words: parallel computing, heterogeneous computing, Singular Value Decomposition(SVD), domestic platform, image processing

杨太龙, 赵红朋, 张磊. 基于国产异构平台的奇异值分解法[J]. 计算机工程, 2024, 50(9): 216-225.

YANG Tailong, ZHAO Hongpeng, ZHANG Lei. Singular Value Decomposition Method Based on Domestic Heterogeneous Platforms[J]. Computer Engineering, 2024, 50(9): 216-225.

收藏文章 0 / 推荐 / 导出引用

链接本文: https://www.ecice06.com/CN/10.19678/j.issn.1000-3428.0068183

https://www.ecice06.com/CN/Y2024/V50/I9/216

图/表 13

图1 双对角化示意图

Fig.1 Schematic diagram of bi-diagonalization

图2 BDSDC程序结构

Fig.2 Structure of BDSDC program

图3 LASD4程序结构

Fig.3 Structure of LASD4 program

图4 mySVD整体流程

Fig.4 The overall procedure of mySVD

图5 LASD4加速比分析

Fig.5 Analysis of LASD4 acceleration ratio

图6 图像压缩与去噪

Fig.6 Image compression and denoising

参考文献 26

1	ZHAO J H, NIE Y W, ZHANG H, et al. A UAV-aided vehicular integrated platooning network for heterogeneous resource management. IEEE Transactions on Green Communications and Networking, 2023, 7 (1): 512- 521. doi: 10.1109/TGCN.2023.3234588
2	LIU J, WU Z H, FENG D L, et al. HeterPS: distributed deep learning with reinforcement learning based scheduling in heterogeneous environments. Future Generation Computer Systems, 2023, 148, 106- 117. doi: 10.1016/j.future.2023.05.032
3	王其涵, 庞建民, 岳峰, 等. 面向申威架构的KNN并行算法实现与优化. 计算机工程, 2023, 49 (5): 286- 294. doi: 10.19678/j.issn.1000-3428.0063954
	WANG Q H, PANG J M, YUE F, et al. Implementation and optimization of parallel KNN algorithm for sunway architecture. Computer Engineering, 2023, 49 (5): 286- 294. doi: 10.19678/j.issn.1000-3428.0063954
4	胡怡, 陈道琨, 杨超, 等. 国产SW26010-Pro处理器上3级BLAS函数众核并行优化. 软件学报, 2024, 35 (3): 1569- 1584. URL
	HU Y, CHEN D K, YANG C, et al. Many-core parallel optimization of level-3 BLAS function on domestic SW26010-pro processor. Journal of Software, 2024, 35 (3): 1569- 1584. URL
5	刘芳芳, 王志军, 汪荃, 等. 国产异构系统上的HPCG并行算法及高效实现. 软件学报, 2021, 32 (8): 2341- 2351. URL
	LIU F F, WANG Z J, WANG Q, et al. Parallel algorithm and efficient implementation of HPCG on domestic heterogeneous systems. Journal of Software, 2021, 32 (8): 2341- 2351. URL
6	VENTURI S, CASEY T. SVD perspectives for augmenting DeepONet flexibility and interpretability. Computer Methods in Applied Mechanics and Engineering, 2023, 403, 115718. doi: 10.1016/j.cma.2022.115718
7	LEI Z P, WANG F Y, LI C Y. A denoising method of partial discharge signal based on improved SVD-VMD. IEEE Transactions on Dielectrics and Electrical Insulation, 2023, 30 (5): 2107- 2116. doi: 10.1109/TDEI.2023.3269725
8	BHATTI A, ISHII T, KANNO N, et al. Region-based SVD processing of high-frequency ultrafast ultrasound to visualize cutaneous vascular networks. Ultrasonics, 2023, 129, 106907. doi: 10.1016/j.ultras.2022.106907
9	XIAO J M, PANG Y F, XUE Q, et al. W-cycle SVD: a multilevel algorithm for batched SVD on GPUs[C]//Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. Washington D. C., USA: IEEE Press, 2022: 1-16.
10	GATES M, TOMOV S, DONGARRA J. Accelerating the SVD two stage bidiagonal reduction and divide and conquer using GPUs. Parallel Computing, 2018, 74, 3- 18. doi: 10.1016/j.parco.2017.10.004
11	YU W J, GU Y, LI Y H. Efficient randomized algorithms for the fixed-precision low-rank matrix approximation. SIAM Journal on Matrix Analysis and Applications, 2018, 39 (3): 1339- 1359. doi: 10.1137/17M1141977
12	DRMAČ Z, VESELIĆ K. New fast and accurate Jacobi SVD algorithm. Ⅰ. SIAM Journal on Matrix Analysis and Applications, 2008, 29 (4): 1322- 1342. doi: 10.1137/050639193
13	DEMMEL J, KAHAN W. Accurate singular values of bidiagonal matrices. SIAM Journal on Scientific and Statistical Computing, 1990, 11 (5): 873- 912. doi: 10.1137/0911052
14	GU M, EISENSTAT S C. A divide-and-conquer algorithm for the bidiagonal SVD. SIAM Journal on Matrix Analysis and Applications, 1995, 16 (1): 79- 92. doi: 10.1137/S0895479892242232
15	MARQUES O, VASCONCELOS P B. Computing the bidiagonal SVD through an associated tridiagonal eigenproblem[C]//Proceedings of International Conference on Vector and Parallel Processing. Berlin, Germany: Springer, 2017: 64-74.
16	HALKO N, MARTINSSON P G, TROPP J A. Finding structure with randomness: probabilistic algorithms for constructing approximate matrix decompositions. SIAM Review, 2011, 53 (2): 217- 288. doi: 10.1137/090771806
17	DRMAČ Z. Algorithm 977. ACM Transactions on Mathematical Software, 2018, 44 (1): 1- 30.
18	NAKATSUKASA Y, BAI Z J, GYGI F. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 2010, 31 (5): 2700- 2720. doi: 10.1137/090774999
19	DONGARRA J J, SORENSEN D C, HAMMARLING S J. Block reduction of matrices to condensed forms for eigenvalue computations. Journal of Computational and Applied Mathematics, 1989, 27 (1/2): 215- 227.
20	LTAIEF H, LUSZCZEK P, DONGARRA J. High-performance bidiagonal reduction using tile algorithms on homogeneous multicore architectures. ACM Transactions on Mathematical Software, 2013, 39 (3): 1- 22.
21	DONGARRA J, GATES M, HAIDAR A, et al. The singular value decomposition: anatomy of optimizing an algorithm for extreme scale. SIAM Review, 2018, 60 (4): 808- 865. doi: 10.1137/17M1117732
22	SUKKARI D, LTAIEF H, KEYES D. A high performance QDWH-SVD solver using hardware accelerators. ACM Transactions on Mathematical Software, 2017, 43 (1): 1- 25.
23	ACOSTA-QUIÑONEZ R I, TORRES-ROMAN D, RODRIGUEZ-AVILA R. HOSVD prototype based on modular SW libraries running on a high-performance CPU+GPU platform. Journal of Systems Architecture, 2021, 113, 101897. doi: 10.1016/j.sysarc.2020.101897
24	KRAINIUK M, GOLI M, PASCUZZI V R. OneAPI Open-Source math library interface[C]//Proceedings of the International Workshop on Performance, Portability and Productivity in HPC (P3HPC). Washington D. C., USA: IEEE Press, 2021: 22-32.
25	AHARON M, ELAD M, BRUCKSTEIN A. K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Transactions on Signal Processing, 2006, 54 (11): 4311- 4322. doi: 10.1109/TSP.2006.881199
26	LI J M, WANG Z L, LI Q, et al. An enhanced K-SVD denoising algorithm based on adaptive soft-threshold shrinkage for fault detection of wind turbine rolling bearing. ISA Transactions, 2023, 142, 454- 464. doi: 10.1016/j.isatra.2023.07.042

编辑推荐 0

Metrics

阅读次数

全文

281

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	7	119	0	155

来源	本网站	其他网站

次数	127	154
比例	45%	55%

摘要

152

最新录用	在线预览	正式出版

59	0	93

来源	本网站	其他网站

次数	48	104
比例	32%	68%

[1]	张磊, 赵光岳, 肖超恩, 王建新. Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现[J]. 计算机工程, 2024, 50(9): 208-215.
[2]	郭伟, 王欣哲, 王江达, 王春艳. 基于卷积调制与空间协作的水下图像增强[J]. 计算机工程, 2024, 50(8): 310-318.
[3]	王安政, 党建武, 岳彪, 杨景玉. 基于位置信息和注意力机制的路面裂缝检测[J]. 计算机工程, 2024, 50(4): 303-312.
[4]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[5]	张天骐, 闻斌, 熊天, 吴超. 基于张量分解与场景分割的鲁棒视频水印算法[J]. 计算机工程, 2023, 49(8): 250-256, 264.
[6]	王其涵, 庞建民, 岳峰, 祝迪, 沈莉, 肖谦. 面向申威架构的KNN并行算法实现与优化[J]. 计算机工程, 2023, 49(5): 286-294.
[7]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[8]	杨晶晶, 谢海燕, 薛妮妮, 张傲明. 基于双通道残差网络的水下图像去噪研究[J]. 计算机工程, 2023, 49(4): 188-198.
[9]	罗嗣卿, 陈慧. 基于生成对抗网络的图像场景转换[J]. 计算机工程, 2023, 49(4): 217-225.
[10]	余嘉昕, 王春媛, 韩华, 高燕. 基于融合代价和优化引导滤波的立体匹配算法[J]. 计算机工程, 2023, 49(3): 257-262,270.
[11]	房俊, 薛晓东, 周云亮. 基于深度生成模型的聚合查询区间估计方法[J]. 计算机工程, 2023, 49(11): 284-292, 301.
[12]	邓天民, 谭思奇, 蒲龙忠. 基于改进YOLOv5s的交通信号灯识别方法[J]. 计算机工程, 2022, 48(9): 55-62.
[13]	杨周凡, 韩林, 李冰洋, 谢景明, 韩璞, 刘勇杰. 基于“嵩山”超级计算机系统的大规模管网仿真[J]. 计算机工程, 2022, 48(9): 155-161.
[14]	高海韬, 李丹宁, 王彬, 唐鑫鑫. 运动模糊图像PSF参数估计方法改进及图像复原[J]. 计算机工程, 2022, 48(9): 197-203,212.
[15]	王富平, 于俊涛, 张锲石. 基于自适应方向导数滤波器的彩色边缘检测[J]. 计算机工程, 2022, 48(9): 204-212.

选择文件类型/文献管理软件名称

选择包含的内容