计算机工程 ›› 2019, Vol. 45 ›› Issue (12): 153-159.doi: 10.19678/j.issn.1000-3428.0053855

• 人工智能及识别技术 • 上一篇    下一篇

一种面向神威·太湖之光的通用并行卷积算法

舒嘉明, 安虹, 武铮, 陈俊仕   

  1. 中国科学技术大学 计算机科学与技术学院, 合肥 230000
  • 收稿日期:2019-01-30 修回日期:2019-04-04 发布日期:2019-05-22
  • 作者简介:舒嘉明(1995-),男,硕士研究生,主研方向为深度学习、计算机系统结构、高性能计算;安虹,教授、博士;武铮、陈俊仕,博士研究生。
  • 基金项目:
    国家重点研发计划(2016YFB1000403)。

A General Parallel Convolution Algorithm for Sunway Taihu Light

SHU Jiaming, AN Hong, WU Zheng, CHEN Junshi   

  1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230000, China
  • Received:2019-01-30 Revised:2019-04-04 Published:2019-05-22

摘要: 神威·太湖之光深度学习库中的并行卷积算法存在批量受限的问题,且传统gemm卷积算法在其硬件架构上效率较低。基于申威异构众核处理器,提出一种无批量限制的通用并行卷积算法。结合异步DMA访存操作和从核间的寄存器通信,使用数据重用和软件流水等方法降低从核访存开销,利用手动向量化的方法充分发挥从核浮点的计算能力。实验结果表明,与基础7层循环算法、gemm算法和Intel平台上的MKL-DNN算法相比,该算法的加速性能较好。

关键词: 神威·太湖之光, 卷积神经网络, 数据重用, 软件流水, 批量受限

Abstract: The parallel convolution algorithm in the deep learning library of Sunway Taihu Light has the problem of batch limitation,and the traditional gemm convolution algorithm is inefficient for its hardware architecture.In order to solve the above problems,a general parallel convolution algorithm without batch limitation based on Sunway heterogeneous multi-core processor is proposed.Combined with asynchronous DMA fetch operations and inter-core register communication,the algorithm communication overhead is reduced by means of data reuse and software pipelining,and the floating point caculation performance of the slave core is fully utilized by using manual vectorization.Experimental results show that compared with the basic 7-layer loop algorithm,gemm algorithm and MKL-DNN algorithm on Intel platform,the acceleration performace of the proposed algorithm is better.

Key words: Sunway Taihu Light, Convolutional Neural Network(CNN), data reuse, software pipelining, batchlimitation

中图分类号: