作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (21): 241-243,254. doi: 10.3969/j.issn.1000-3428.2011.21.082

• 工程应用技术与实现 • 上一篇    下一篇

LDLT分解协处理器的并行结构研究

郭 磊1,唐玉华1,周 杰1,董亚卓2   

  1. (1. 国防科技大学并行与分布处理国家重点实验室,长沙 410073;2. 中国人民解放军91655部队,北京 100036)
  • 出版日期:2011-11-05 发布日期:2012-05-10
  • 作者简介:郭 磊(1987-),男,硕士研究生,主研方向:高性能计算机体系结构;唐玉华,研究员;周 杰,博士研究生;董亚卓,助理研究员、博士
  • 基金资助:
    国家自然科学基金资助项目(60921062, 60903057)

Research on Parallel Architecture for LDLT Decomposition Co-processor

GUO Lei 1, TANG Yu-hua 1, ZHOU Jie 1, DONG Ya-zhuo 2   

  1. (1. National Key Laboratory for Parallel & Distributed Processing, National University of Defense Technology, Changsha 410073, China; 2. PLA 91655 Unit, Beijing 100036, China)
  • Online:2011-11-05 Published:2012-05-10

摘要: 为提高LDLT分解协处理器的性能,基于FPGA平台,研究其并行结构。分析循环片间的数据依赖关系,提出LDLT分解细粒度并行算法,并在可扩展一维阵列处理器中加以实现,利用主机、算法加速器组成单精度浮点LDLT分解协处理器的并行结构。实验结果表明,与运行在2.50 GHz Pentium微处理器上的C代码相比,该协处理器可获得32.03倍~43.25倍的性能提升。

关键词: LDLT分解, 现场可编程门阵列, 细粒度并行, 协处理器

Abstract: This paper studies parallel architecture and implementation for large-scale symmetric matrix LDLT decomposition co-processor which based on Field Programmable Gate Array(FPGA) platform to enhance the performance of it. It proposes a fine-grained parallel algorithm basing the data dependency analysis. Then a scalable LDLT decomposition array processor is presented to implement this algorithm. Main engine and arithmetic accelerator constitute the parallel architecture of a single precision floating-point LDLT decomposition co-processor. Experimental results show that, a maximum factor of 43.25 and 32.03 in average speedup can be achieved compare to 2.50 GHz Pentium CPU with C program.

Key words: LDLT decomposition, Field Programmable Gate Array(FPGA), fine grit parallel, coprocessor

中图分类号: