面向深度学习推理的矩阵乘法加速器设计

doi:10.19678/j.issn.1000-3428.0052372

计算机工程 ›› 2019, Vol. 45 ›› Issue (10): 40-45. doi: 10.19678/j.issn.1000-3428.0052372

面向深度学习推理的矩阵乘法加速器设计

冉德成, 吴东, 钱磊

数学工程与先进计算国家重点实验室, 江苏无锡 214125

收稿日期:2018-08-10 修回日期:2018-09-16 出版日期:2019-10-15 发布日期:2018-10-25
作者简介:冉德成(1989-),男,硕士研究生,主研方向为可重构加速计算;吴东,研究员、博士;钱磊,工程师、硕士。
基金资助:
国家自然科学基金（61732010）。

Design of Matrix Multiplication Accelerator for Deep Learning Inference

RAN Decheng, WU Dong, QIAN Lei

State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, Jiangsu 214125, China

Received:2018-08-10 Revised:2018-09-16 Online:2019-10-15 Published:2018-10-25

摘要/Abstract

摘要： 为满足深度学习推理中对不同规模矩阵乘法的计算需求，提出一种基于Zynq SoC平台的整数矩阵乘法加速器。采用基于总线广播的并行结构，充分利用片上数据的重用性并最小化中间累加结果的移动范围，以降低外部DRAM的访问需求。通过动态调整矩阵分块的大小，使加速器在计算形状不规则的矩阵乘时保持较高效率。实验结果表明，在DeepBench测试基准下，该加速器可对双核ARM Cortex-A9 CPU的矩阵乘运算实现8.4倍的加速效果。

关键词: 整数矩阵乘法, 加速器, 可编程片上系统, 深度学习推理, 分块方案, DeepBench测试

Abstract: An integer matrix multiplication accelerator based on Zynq SoC platform is proposed to satisfy the computing requirements of matrix multiplication of different sizes in deep learning inference.The parallel architecture based on bus broadcasting makes full use of the reusability of on chip data and minimizes the moving range of intermediate cumulative result to reduce the access requirement of external DRAM.By dynamically adjusting the size of matrix blocks,the accelerator can maintain high efficiency in calculating matrix multiplication with irregular shape.Experimental results show that under DeepBench test benchmark,the accelerator can achieve 8.4 times acceleration effect for matrix multiplication of dual-core ARM Cortex-A9 CPU.

Key words: integer matrix multiplication, accelerator, programmable System on Chip(SoC), deep learning inference, blocking scheme, DeepBench test

中图分类号:

TP391

冉德成, 吴东, 钱磊. 面向深度学习推理的矩阵乘法加速器设计[J]. 计算机工程, 2019, 45(10): 40-45.

RAN Decheng, WU Dong, QIAN Lei. Design of Matrix Multiplication Accelerator for Deep Learning Inference[J]. Computer Engineering, 2019, 45(10): 40-45.

https://www.ecice06.com/CN/Y2019/V45/I10/40

图/表 13

20191014192913

20191014192916

20191014192919

20191014192921

20191014192924

20191014192927

20191014192930

20191014192933

20191014192936

20191014192939

20191014192944

20191014192947

20191014192951

参考文献 16

[1]	JACOB B,KLIGYS S,CHEN Bo,et al.Quantization and training of neural networks for efficient integer-arithmetic-only inference[EB/OL].[2018-07-20].https://arxiv.org/pdf/1712.05877.pdf.
[2]	DETTMERS T.8-bit approximations for parallelism in deep learning[EB/OL].[2018-07-20].https://arxiv.org/pdf/1511.04561.pdf.
[3]	GYSEL P,MOTAMEDI M,GHIASI S.Hardware-oriented approximation of convolutional neural networks[EB/OL].[2018-07-20].https://arxiv.org/pdf/1604.03168.pdf.
[4]	HAN Song,MAO Huizi,DALLY W J.Deep compression:compressing deep neural networks with pruning,trained quantization and huffman coding[EB/OL].[2018-07-20].https://arxiv.org/pdf/1510.00149.pdf.
[5]	NARANG S,DIAMOS G.An update to DeepBench with a focus on deep learning inference[EB/OL].[2018-07-20].https://svail.github.io/DeepBench-update.
[6]	JANG J,CHOI S,PRASANNA V K K.Area and time efficient implementations of matrix multiplication on FPGAs[C]//Proceedings of IEEE International Conference on Field-programmable Technology.Washington D.C.,USA:IEEE Press,2002:93-100.
[7]	CAMPBELL S J,KHATRI S P.Resource and delay efficient matrix multiplication using newer FPGA devices[C]//Proceedings of the 16th ACM Great Lakes Symposium on VLSI.New York,USA:ACM Press,2006:308-311.
[8]	EL-ATFY R,DESSOUKY M A,EL-GHITANI H.Accelerating matrix multiplication on FPGAs[C]//Proceedings of the 2nd International Design and Test Workshop.Washington D.C.,USA:IEEE Press,2007:203-204.
[9]	DAVE N,FLEMING K,KING M,et al.Hardware accele-ration of matrix multiplication on a Xilinx FPGA[C]//Proceedings of IEEE/ACM International Conference on Formal Methods and Models for Codesign.Washington D.C.,USA:IEEE Press,2007:97-100.
[10]	田翔,周凡,陈耀武,等.基于FPGA的实时双精度浮点矩阵乘法器设计[J].浙江大学学报(工学版),2008,42(9):1611-1615.
[11]	张婷.嵌入式环境下浮点矩阵乘法的FPGA加速关键技术研究[D].长沙:湖南大学,2013.
[12]	马邺晨,李醒飞.用于导航解算的矩阵运算硬件加速器设计[J].计算机工程,2014,40(8):259-263.
[13]	Intel Corporation.Intel Xeon Phi delivers competitive performance for deep learning and getting better fast[EB/OL].[2018-07-20].https://software.intel.com/en-us/articles/intel-xeon-phi-delivers-competitive-performance-for-deep-learningand-getting-better-fast.
[14]	MOSS D J M,KRISHNAN S,NURVITADHI E,et al.A customizable matrix multiplication framework for the Intel HARPv2 Xeon+ FPGA platform:a deep learning case study[C]//Proceedings of 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays.New York,USA:ACM Press,2018:107-116.
[15]	KRIZHEVSKY A,SUTSKEVER I,HINTON G E.ImageNet classification with deep convolutional neural networks[C]//Proceedings of the 25th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2012:1097-1105.
[16]	Xillybus.Xillybus host application programming guide for Linux[EB/OL].[2018-07-20].http://xillybus.com/downloads/doc/xillybus_host_programming_guide_linux.pdf.

选择文件类型/文献管理软件名称

选择包含的内容

面向深度学习推理的矩阵乘法加速器设计

Design of Matrix Multiplication Accelerator for Deep Learning Inference

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 16

相关文章 15

编辑推荐

Metrics

本文评价

[1]	关明晓, 刘嘉堃, 张鸿锐, 何安平. 基于FPGA误差可控的浮点运算加速器研究[J]. 计算机工程, 2024, 50(5): 291-297.
[2]	陈锐, 孙羽菲, 郭强, 隋轶丞, 周振辉, 石昌青, 张玉志. OclDNN:一种可应用于TensorFlow的通用DNN库[J]. 计算机工程, 2023, 49(4): 138-148.
[3]	洪起润, 王琴. 基于帧间数据复用的稀疏CNN加速器设计[J]. 计算机工程, 2023, 49(12): 55-62.
[4]	陈逸, 刘博生, 徐永祺, 武继刚. 混合精度频域卷积神经网络FPGA加速器设计[J]. 计算机工程, 2023, 49(12): 1-9.
[5]	杨周凡, 韩林, 李冰洋, 谢景明, 韩璞, 刘勇杰. 基于“嵩山”超级计算机系统的大规模管网仿真[J]. 计算机工程, 2022, 48(9): 155-161.
[6]	黄正伟, 刘宏伟, 徐渊. 用于IToF传感器的极低功耗RISC-V专用处理器设计[J]. 计算机工程, 2022, 48(9): 146-154.
[7]	狄新凯, 杨海钢. 基于FPGA的稀疏化卷积神经网络加速器[J]. 计算机工程, 2021, 47(7): 189-195,204.
[8]	石永泉, 景乃锋. 基于FPGA模拟的阻变神经网络加速器评估方法[J]. 计算机工程, 2021, 47(12): 209-214.
[9]	范宏伟,胡宇翔,兰巨龙. 基于FPGA的虚拟网络功能数据包处理加速架构[J]. 计算机工程, 2018, 44(8): 112-119,126.
[10]	余子健,马德,严晓浪,沈君成. 基于FPGA的卷积神经网络加速器[J]. 计算机工程, 2017, 43(1): 109-114,119.
[11]	马邺晨,李醒飞. 用于导航解算的矩阵运算硬件加速器设计[J]. 计算机工程, 2014, 40(8): 259-263.
[12]	黄睿，杨庆庆，程洁琼，周晓方. LDPC与Turbo解码器中的专用控制器设计[J]. 计算机工程, 2014, 40(7): 1-6.
[13]	李瑞珍,张晓旭,马德,黄凯,严晓浪. 一种灵活可配置的JPEG 编解码器软硬件架构[J]. 计算机工程, 2014, 40(11): 266-272.
[14]	苏攀,张以涛,杨红官,李艳超. 基于SOPC 的数字预失真器设计与实现[J]. 计算机工程, 2014, 40(10): 118-121.
[15]	朱丹, 王家宁, 朱玙骅. 一种多核远程重构控制器的设计与实现[J]. 计算机工程, 2011, 37(9): 254-256.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

面向深度学习推理的矩阵乘法加速器设计

Design of Matrix Multiplication Accelerator for Deep Learning Inference

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 13

参考文献 16

相关文章 15

编辑推荐

Metrics

本文评价