[1]NATH R,TOMOV S,DONGARRA J.An improved magma gemm for Fermi graphics processing units[J].International Journal of High Performance Computing Applications,2010,24(4):511-515.
[2]TAN G.Fast implementation of DGEMM on Fermi GPU[C]//Proceedings of International Conference on High Performance Computing,Networking,Storage and Analysis.Washington D.C.,USA:IEEE Computer Society,2011:30-35.
[3]刘刚,张恒,毛睿,等.面向龙芯3B1500体系结构的DGEMM函数优化[J].小型微型计算机系统,2014,35(7):1523-1527.
[4]JAISWAL M K,CHANDRACHOODAN N.FPGA-based high-performance and scalable block LU decomposition architecture[J].IEEE Transactions on Computers,2011,61(1):60-72.
[5]MICHAILIDIS P D,MARGARITIS K G.Implementing parallel LU factorization with pipelining on a multicore using OpenMP[C]//Proceedings of IEEE International Conference on Computational Science and Engineering.Washington D.C.,USA:IEEE Press,2011:253-260.
[6]VENETIS I E,GAO G R.Mapping the LU decomposition on a many-core architecture:challenges and solutions[C]//Proceedings of ACM Conference on Computing Frontiers.New York,USA:ACM Press,2009:71-80.
[7]唐云.基于Spark的大规模分布式矩阵运算算法研究与实现[D].南京:南京大学,2016.
[8]杨飞,马昱春,侯金,等.基于MPSoC并行调度的矩阵乘法加速算法研究[J].计算机科学,2017,44(8):36-41.
[9]龙卓群,王晓瑜,王昌明.基于DCT预测编码的Epiphany-OpenCL大矩阵乘并行计算[J].自动化与仪表,2017,32(7):16-21.
[10]沈俊忠,肖涛,乔寓然,等.一种支持优化分块策略的矩阵乘加速器设计[J].计算机工程与科学,2016,38(9):1748-1754.
[11]魏帅.面向SIMD的向量化算法及重组技术研究[D].郑州:解放军信息工程大学,2012.
[12]张凯.向量SIMD DSP上高效矩阵运算技术研究[D].长沙:国防科技大学,2013.
[13]朱海涛,陈云霁,钱诚,等.基于向量扩展多核处理器的矩阵乘法算法优化研究[J].中国科学技术大学学报,2011,41(2):173-182.
[14]王捷.一种高性能向量处理器的实现[D].天津:天津大学,2016.
[15]刘仲,田希.面向多核向量处理器的矩阵乘法向量化方法[J].计算机学报,2018,41(10):2251-2264. |