面向FT-M6678的对称矩阵特征值求解算法实现与优化

doi:10.19678/j.issn.1000-3428.0067536

摘要/Abstract

摘要：

目前国产自主可控FT-M6678平台上没有对称矩阵特征值求解相关的实现，且平台上现有数学计算库不能很好地满足类似问题求解的需求。面向国产FT-M6678处理器，对对称矩阵特征值求解（SYEV）算法进行实现与优化，完善FT-M6678平台的线性代数计算库。通过对SYEV算法的实现过程以及运行热点的分析，基于FT-M6678平台进行编译优化、访存优化以及向量并行化优化，其中：编译优化是根据不同的编译选项指导编译器对程序优化以达到加速效果；访存优化包括缓存优化以及数据段与程序段的分配优化，用于提高矩阵数据的访存效率；向量并行化优化包括循环展开以及适配FT-M6678平台的单指令多数据流（SIMD）指令并行优化，用于提升程序的计算效率。在FT-M6678平台上对所实现并优化的算法进行正确性验证与优化性能分析，结果表明，算法能够正确通过LAPACK官方测试集测试，并且在FT-M6678平台上的加速效果可达到58.346倍，对比TMS320C6678平台速度可提升2.053倍。

关键词: 对称矩阵特征值, FT-M6678平台, 热点分析, 缓存优化, 向量并行

Abstract:

Currently, there is no implementation related to the symmetric matrix eigenvalue solution on China's autonomous and controllable FT-M6678 platform, and the existing mathematical calculation library on this platform cannot satisfy the requirements for solving similar problems. This study focuses on the domestic FT-M6678 processor, implements and optimizes the algorithm of the symmetric matrix eigenvalue solution, SYEV, and improves the linear algebra calculation library of the FT-M6678 platform. First, by analyzing the implementation process and running hotspots of the SYEV algorithm, compile, memory access, and vector parallel optimizations are performed based on the FT-M6678 platform. Compilation optimization refers to guiding the compiler to optimize programs based on different compilation options to achieve acceleration effects; memory access optimization includes cache optimization and allocation optimization of data and program segments, accelerating the efficiency of matrix data access; and vector parallelization optimization includes loop unrolling and Single Instruction Multiple Data(SIMD)instruction parallel optimization adapted to the FT-M6678 platform, which improves the computational efficiency of programs. Verification and performance tests of the implemented and optimized algorithms are performed using the FT-M6678 platform. The accuracy of the algorithms passes the test of official Linear Algebra PACKage(LAPACK)test set, and the optimization acceleration effect of the algorithm on the FT-M6678 platform can reach 58.346 times, which can improve the speed by 2.053 times compared with the TMS320C6678 platform.

Key words: symmetric matrix eigenvalue, FT-M6678 platform, hotspot analysis, cache optimization, vector parallelism

于立, 韩林, 罗有才, 商建东. 面向FT-M6678的对称矩阵特征值求解算法实现与优化[J]. 计算机工程, 2024, 50(2): 51-58.

Li YU, Lin HAN, Youcai LUO, Jiandong SHANG. Algorithm Implementation and Optimization of Symmetric Matrix Eigenvalue Solution for FT-M6678[J]. Computer Engineering, 2024, 50(2): 51-58.

http://www.ecice06.com/CN/Y2024/V50/I2/51

图/表 12

图1 FT-M6678体系架构

Fig.1 FT-M6678 architecture

图2 SYEV算法实现过程

Fig.2 Implementation process of SYEV algorithm

图3 运行热点分析图

Fig.3 Analysis diagram of operational hotspots

图4 FT-M6678多级内核存储结构

Fig.4 FT-M6678 multilevel kernel storage structure

图5 FT-M6678 128位并行乘示意图

Fig.5 Schematic diagram of FT-M6678 128 bit parallel multiplication

图6 FT-M6678正确性测试结果

Fig.6 FT-M6678 correctness test results

图7 纵向与横向加速比对比

Fig.7 Comparison of longitudinal and lateral acceleration ratios

参考文献 27

1	HUANG X F, TANG R, ZHOU Y, et al. DSP-based parallel optimization for real-time video stitching. Journal of Real-Time Image Processing, 2023, 20(2): 28. doi: 10.1007/s11554-023-01275-x
2	WANG Y H, LI C, LIU C, et al. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Transactions on High Performance Computing, 2021, 3(1): 114- 125. doi: 10.1007/s42514-020-00057-2
3	KIM W, LEE S, YUN I, et al. Energy-efficient dataflow scheduling of CNN applications for vector-SIMD DSP. IEEE Access, 2022, 10, 86234- 86247. doi: 10.1109/ACCESS.2022.3197206
4	HAJIRASSOULIHA A, TABERNER A J, NASH M P, et al. Suitability of recent hardware accelerators(DSPs, FPGAs, and GPUs) for computer vision and image processing algorithms. Signal Processing: Image Communication, 2018, 68, 101- 119. doi: 10.1016/j.image.2018.07.007
5	方建滨, 杜琦, 唐滔, 等. 飞腾处理器与商用处理器性能比较. 计算机工程与科学, 2019, 41(1): 1- 8. doi: 10.3969/j.issn.1007-130X.2019.01.001
	FANG J B, DU Q, TANG T, et al. Performance comparison between FT-1500A and Intel Xeon. Computer Engineering and Science, 2019, 41(1): 1- 8. doi: 10.3969/j.issn.1007-130X.2019.01.001
6	HASHEMI B, NAKATSUKASA Y, TREFETHEN L N. Rectangular eigenvalue problems. Advances in Computational Mathematics, 2022, 48(6): 80. doi: 10.1007/s10444-022-09994-8
7	ANDERSON E. LAPACK Users' guide. Third ed. [S. l.]: Society for Industrial and Applied Mathematics, 1999.
8	LENG H N, HE Z Q. Eigenvalue bounds for symmetric matrices with entries in one interval. Applied Mathematics and Computation, 2017, 299, 58- 65. doi: 10.1016/j.amc.2016.11.035
9	HERNANDEZ T M, VAN BEEUMEN R, CAPRIO M A, et al. A greedy algorithm for computing eigenvalues of a symmetric matrix with localized eigenvectors. Numerical Linear Algebra with Applications, 2021, 28(2): 1- 16.
10	刘彦. 基于飞腾2000+的BLAS3函数优化与实现[D]. 长沙: 湖南大学, 2020.
	LIU Y. Optimization and implementation of BLAS3 function based on FT-2000+[D]. Changsha: Hunan University, 2020. (in Chinese)
11	LIU F F, MA W J, ZHAO Y W, et al. xMath2.0: a high-performance extended math library for SW26010-Pro many-core processor. CCF Transactions on High Performance Computing, 2023, 5(1): 56- 71. doi: 10.1007/s42514-022-00126-8
12	吴颖. 基于鲲鹏处理器的LAPACK对称矩阵方程求解例程的性能优化研究[D]. 兰州: 兰州大学, 2022.
	WU Y. Research on performance optimization of LAPACK routines for solving symmetric matrix linear equation based on Kunpeng processor[D]. Lanzhou: Lanzhou University, 2020. (in Chinese)
13	刘斌斌, 顾乃杰, 任开新, 等. LAPACK线性方程求解函数在龙芯3A上的并行化. 小型微型计算机系统, 2014, 35(5): 1085- 1089. doi: 10.3969/j.issn.1000-1220.2014.05.028
	LIU B B, GU N J, REN K X, et al. Parallelization of LAPACK linear equation functions based on Loongson 3A. Journal of Chinese Computer Systems, 2014, 35(5): 1085- 1089. doi: 10.3969/j.issn.1000-1220.2014.05.028
14	邢克飞, 王跃科, 扈啸. 银河飞腾DSP芯片总剂量辐照试验研究. 半导体技术, 2006, 31(7): 493-494, 505. doi: 10.3969/j.issn.1003-353X.2006.07.004
	XING K F, WANG Y K, HU X. Total ionizing dose effects test of domastic high quality device YHFT-DSP. Semiconductor Technology, 2006, 31(7): 493-494, 505. doi: 10.3969/j.issn.1003-353X.2006.07.004
15	杨琳, 吴家铸, 扈啸, 等. 互相关运算在银河飞腾DSP上的实现及优化. 计算机科学, 2015, 42(11): 53- 55.
	YANG L, WU J Z, HU X, et al. Realization and optimization of cross-correlation based on YHFT-QDSP. Computer Science, 2015, 42(11): 53- 55.
16	王正行, 曾令将. 基于飞腾M6678的向量数学库优化技术研究. 舰船电子工程, 2021, 41(3): 102- 106.
	WANG Z X, ZENG L J. Research on performance optimization of vector math library based on FT-M6678. Ship Electronic Engineering, 2021, 41(3): 102- 106.
17	夏际金, 赵洪立, 李川. TI C66x多核DSP高级软件开发技术. 北京: 清华大学出版社, 2017.
	XIA J J, ZHAO H L, LI C. Advanced software development technology of TI C66x multi-core DSP. Beijing: Tsinghua University Press, 2017.
18	胡江涛. 面向飞腾DSP的模板匹配算法的实现与优化[D]. 郑州: 郑州大学, 2020.
	HU J T. Implantation and optimization of template matching algorithm for Phytium DSP[D]. Zhengzhou: Zhengzhou University, 2020. (in Chinese)
19	景德胜, 陈川, 刘婷婷. 基于FT-M6678处理器的嵌入式计算机电源设计及实现. 航空计算技术, 2021, 51(5): 122- 125.
	JING D S, CHEN C, LIU T T. Design and implementation of embedded computer power supply based on FT-M6678. Aeronautical Computing Technique, 2021, 51(5): 122- 125.
20	CASTELLÓ A, CATALÁN S, IGUAL F D, et al. QR factorization using malleable BLAS on multicore processors[C]//Proceedings of ISC High Performance 2022. Hamburg, Germany: [s, n, ], 2022: 176-189.
21	YANG L M, FOX A, SANDERS G. Rounding error analysis of mixed precision block householder QR algorithms. SIAM Journal on Scientific Computing, 2021, 43(3): 1723- 1753. doi: 10.1137/19M1296367
22	杨永舟, 黄秀琼. 基于HLS的复数矩阵QR分解求逆算法的实现与优化. 电子技术, 2021, 50(7): 74- 78.
	YANG Y Z, HUANG X Q. Realization and optimization of inverse algorithm of complex matrix QR decomposition based on HLS. Electronic Technology, 2021, 50(7): 74- 78.
23	孙延鹏. QR分解技术在递推系统辨识中的应用[D]. 北京: 北京交通大学, 2008.
	SUN Y P. Application of QR decomposition techniques in recursive system identification[D]. Beijing: Beijing Jiaotong University, 2008. (in Chinese)
24	DONGARRA J J, DU CROZ J, HAMMARLING S, et al. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software, 1990, 16(1): 1- 17.
25	LAWSON C L, HANSON R J, KINCAID D R, et al. Basic linear algebra subprograms for fortran usage. ACM Transactions on Mathematical Software, 1979, 5(3): 308- 323.
26	董言治, 娄树理, 刘松涛. TMS320C6000系列DSP系统结构原理与应用教程. 北京: 清华大学出版社, 2014: 193- 195.
	DONG Y Z, LOU S L, LIU S T. Structure principle and application course of TMS320C6000 series DSP system. Beijing: Tsinghua University Press, 2014: 193- 195.
27	孙昆磊. 国产处理器实现SAR算法[D]. 西安: 西安电子科技大学, 2021.
	SUN K L. Implementation of SAR algorithm by domestic processor[D]. Xi'an: Xidian University, 2021. (in Chinese)

[1]	范明亮, 郭子涵, 柴晓楠, 商建东. 面向FT-M7002的Sobel边缘检测算法优化实现[J]. 计算机工程, 2022, 48(6): 193-199.
[2]	左利云. 基于规则驱动模型的代理缓存优化调度算法[J]. 计算机工程, 2009, 35(24): 93-95.
[3]	邓亚丹;景宁;熊伟. 一种新的数据库访问图算法及其应用[J]. 计算机工程, 2009, 35(17): 25-27.

选择文件类型/文献管理软件名称

选择包含的内容