作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2019, Vol. 45 ›› Issue (10): 40-45. doi: 10.19678/j.issn.1000-3428.0052372

• 体系结构与软件技术 • 上一篇    下一篇

面向深度学习推理的矩阵乘法加速器设计

冉德成, 吴东, 钱磊   

  1. 数学工程与先进计算国家重点实验室, 江苏 无锡 214125
  • 收稿日期:2018-08-10 修回日期:2018-09-16 出版日期:2019-10-15 发布日期:2018-10-25
  • 作者简介:冉德成(1989-),男,硕士研究生,主研方向为可重构加速计算;吴东,研究员、博士;钱磊,工程师、硕士。
  • 基金资助:
    国家自然科学基金(61732010)。

Design of Matrix Multiplication Accelerator for Deep Learning Inference

RAN Decheng, WU Dong, QIAN Lei   

  1. State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125, China
  • Received:2018-08-10 Revised:2018-09-16 Online:2019-10-15 Published:2018-10-25

摘要: 为满足深度学习推理中对不同规模矩阵乘法的计算需求,提出一种基于Zynq SoC平台的整数矩阵乘法加速器。采用基于总线广播的并行结构,充分利用片上数据的重用性并最小化中间累加结果的移动范围,以降低外部DRAM的访问需求。通过动态调整矩阵分块的大小,使加速器在计算形状不规则的矩阵乘时保持较高效率。实验结果表明,在DeepBench测试基准下,该加速器可对双核ARM Cortex-A9 CPU的矩阵乘运算实现8.4倍的加速效果。

关键词: 整数矩阵乘法, 加速器, 可编程片上系统, 深度学习推理, 分块方案, DeepBench测试

Abstract: An integer matrix multiplication accelerator based on Zynq SoC platform is proposed to satisfy the computing requirements of matrix multiplication of different sizes in deep learning inference.The parallel architecture based on bus broadcasting makes full use of the reusability of on chip data and minimizes the moving range of intermediate cumulative result to reduce the access requirement of external DRAM.By dynamically adjusting the size of matrix blocks,the accelerator can maintain high efficiency in calculating matrix multiplication with irregular shape.Experimental results show that under DeepBench test benchmark,the accelerator can achieve 8.4 times acceleration effect for matrix multiplication of dual-core ARM Cortex-A9 CPU.

Key words: integer matrix multiplication, accelerator, programmable System on Chip(SoC), deep learning inference, blocking scheme, DeepBench test

中图分类号: