[1]NATH R,TOMOV S,DONGARRA J.An improved magma gemm for Fermi graphics processing units[J].International Journal of High Performance Computing Applications,2010,24(4):511-515.
[2]TAN G.Fast implementation of DGEMM on Fermi GPU[C]//Proceedings of International Conference on High Performance Computing,Networking,Storage and Analysis.Washington D.C.,USA:IEEE Computer Society,2011:30-35.
[4]JAISWAL M K,CHANDRACHOODAN N.FPGA-based high-performance and scalable block LU decomposition architecture[J].IEEE Transactions on Computers,2011,61(1):60-72.
[5]MICHAILIDIS P D,MARGARITIS K G.Implementing parallel LU factorization with pipelining on a multicore using OpenMP[C]//Proceedings of IEEE International Conference on Computational Science and Engineering.Washington D.C.,USA:IEEE Press,2011:253-260.
[6]VENETIS I E,GAO G R.Mapping the LU decomposition on a many-core architecture:challenges and solutions[C]//Proceedings of ACM Conference on Computing Frontiers.New York,USA:ACM Press,2009:71-80.
[12]张凯.向量SIMD DSP上高效矩阵运算技术研究[D].长沙:国防科技大学,2013.
[15]刘仲,田希.面向多核向量处理器的矩阵乘法向量化方法[J].计算机学报,2018,41(10):2251-2264. |