OclDNN:一种可应用于TensorFlow的通用DNN库

doi:10.19678/j.issn.1000-3428.0064142

摘要/Abstract

摘要： 深度学习模型的构建、训练以及推理离不开TensorFlow等机器学习框架中深度学习算子的支撑，对于卷积、池化等深度学习中被高频调用或计算量较大的算子，机器学习框架一般通过调用深度神经网络（DNN）库来提升计算效能。现有DNN库主要由英伟达、AMD等少数国外厂商开发并根据自有硬件设备特点进行优化，但其封闭性导致其他厂商生产的通用加速器难以在深度学习领域发挥作用。为解决现有DNN库无法支持国产加速器的问题，使得深度学习模型能够调用国产加速器进行运算，研究跨平台的通用DNN库，通过对开源MIOpen的结构特点和调用方式进行分析，提出修改和重构该库的方法，并实现一种基于OpenCL的DNN（OclDNN）库。考虑到TensorFlow较高的流行度及其对DNN库调用的特殊性与复杂性，研究通用DNN库在TensorFlow中的集成方法，通过StreamExecutor中的OpenCL平台实现对OclDNN的调用。实验结果表明，OclDNN在英伟达、华为等不同厂商的计算设备上运算结果正确可靠，在相同实验环境下，深度学习算子使用OclDNN时的加速性能比传统CPU并行算法提升了5~60倍。

关键词: 深度神经网络库, 深度学习, 开放计算语言, 硬件加速器, TensorFlow框架

Abstract: In machine learning frameworks such as TensorFlow, the construction, training, and reasoning of deep learning models rely on the support of deep learning operators.The efficiency of frequently used or computationally heavy deep learning operators, such as convolution and pooling, is improved in machine learning frameworks by using Deep Neural Network(DNN) libraries.The existing DNN libraries are mainly developed by a few hardware manufacturers such as Nvidia and AMD and optimized particularly for the characteristics of their own hardware equipment.Consequently, it is challenging for other hardware manufacturers to explore deep learning domain using the existing DNN libraries.To address this challenge faced by the domestic hardware accelerators and to enable deep learning models to use this hardware easily for calculations, a cross-platform generic DNN library was studied.The analysis of the structural characteristics and call methods of the open-source library MIOpen was performed.On the basis of the results, a method of modifying and reconstructing the library was developed, and OclDNN, a DNN library based on OpenCL, was implemented.Considering the high popularity of TensorFlow and the specificity and complexity of calling DNN libraries, the method of integrating generic DNN libraries in TensorFlow was studied.The call to OclDNN was implemented using the OpenCL platform in StreamExecutor.The experimental results indicate that OclDNN operates correctly and reliably on the computing devices from Nvidia, Huawei, and other hardware manufacturers.Moreover, the acceleration performance of deep learning operators using OclDNN is 5-60 times higher than that of the traditional CPU parallel algorithm.

Key words: Deep Neural Network(DNN) library, deep learning, Open Computing Language(OpenCL), hardware accelerator, TensorFlow framework

中图分类号:

TP391

陈锐, 孙羽菲, 郭强, 隋轶丞, 周振辉, 石昌青, 张玉志. OclDNN:一种可应用于TensorFlow的通用DNN库[J]. 计算机工程, 2023, 49(4): 138-148.

CHEN Rui, SUN Yufei, GUO Qiang, SUI Yicheng, ZHOU Zhenhui, SHI Changqing, ZHANG Yuzhi. OclDNN: A General-Purpose DNN Library for TensorFlow[J]. Computer Engineering, 2023, 49(4): 138-148.

https://www.ecice06.com/CN/Y2023/V49/I4/138

图/表 18

20230417185030

20230417185033

20230417185036

20230417185039

20230417185042

20230417185045

20230417185049

20230417185052

20230417185055

20230417185100

20230417185103

20230417185106

20230417185110

20230417185113

20230417185117

20230417185121

20230417185125

20230417185128

参考文献

[1] 余凯, 贾磊, 陈雨强, 等.深度学习的昨天、今天和明天[J].计算机研究与发展, 2013, 50(9):1799-1804. YU K, JIA L, CHEN Y Q, et al.Deep learning:yesterday, today, and tomorrow[J].Journal of Computer Research and Development, 2013, 50(9):1799-1804.(in Chinese)
[2] LECUN Y, BENGIO Y, HINTON G.Deep learning[J].Nature, 2015, 521(7553):436-444.
[3] HASSABALLAH M, AWAD A I.Deep learning in computer vision:principles and applications[M].[S.l.]:CRC Press, 2020.
[4] OTTER D W, MEDINA J R, KALITA J K.A survey of the usages of deep learning for natural language processing[J].IEEE Transactions on Neural Networks and Learning Systems, 2021, 32(2):604-624.
[5] GOODFELLOW I, BENGIO Y, COURVILLE A.Deep learning[M].Cambridge, USA:MIT Press, 2016.
[6] ABADI M, BARHAM P, CHEN J M, et al.TensorFlow:a system for large-scale machine learning[C]//Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation.San Diego, USA:USENIX Association, 2016:265-283.
[7] PASZKE A, GROSS S, MASSA F, et al.PyTorch:an imperative style, high-performance deep learning library[EB/OL].[2022-02-05].https://arxiv.org/abs/1912.01703.
[8] Intel.oneDNN documentation[EB/OL].[2022-02-05].https://oneapi-src.github.io/oneDNN/.
[9] CHETLUR S, WOOLLEY C, VANDERMERSCH P, et al.cuDNN:efficient primitives for deep learning[EB/OL].[2022-02-05].https://arxiv.org/abs/1410.0759.
[10] KHAN J, FULTZ P, TAMAZOV A, et al.MIOpen:an open source library for deep learning primitives[EB/OL].[2022-02-05].https://arxiv.org/abs/1910.00078.
[11] Intel.clDNN documentation[EB/OL].[2022-02-05].https://intel.github.io/clDNN/index.html.
[12] MUNSHI A.The OpenCL specification[C]//Proceedings of IEEE Hot Chips 21 Symposium.Washington D.C., USA:IEEE Press, 2016:1-314.
[13] DAEYEON K.OpenDNN:an open-source, cuDNN-like deep learning primitive library[EB/OL].[2022-02-05].https://s-space.snu.ac.kr/bitstream/10371/150799/1/000000154337.pdf.
[14] STONE J E, GOHARA D, SHI G C.OpenCL:a parallel programming standard for heterogeneous computing systems[J].Computing in Science & Engineering, 2010, 12(3):66-72.
[15] SZEGEDY C, TOSHEV A, ERHAN D.Deep neural networks for object detection[EB/OL].[2022-02-05].https://www.semanticscholar.org/paper/Deep-Neural-Net works-for-Object-Detection-Szegedy-Toshev/713f73ce5c3013d9fb796c21b981dc6629af0bd5.
[16] LARABEL M.Google looks to open up StreamExecutor to make GPGPU programming easier[EB/OL].[2022-02-25].https://www.phoronix.com/scan.php?page=news_item&px=Google-StreamExec-Parallel.
[17] HABIBZADEH M, SHISHVAN O R, SOYATA T.CUDA libraries[M]//SOYATA T.GPU parallel program development using CUDA.Berlin, Germany:Springer, 2018:383-395.
[18] SZEGEDY C, IOFFE S, VANHOUCKE V, et al.Inception-v4, inception-ResNet and the impact of residual connections on learning[C]//Proceedings of the 31st AAAI Conference on Artificial Intelligence.New York, USA:ACM Press, 2017:4278-4284.
[19] RONNEBERGER O, FISCHER P, BROX T.U-net:convolutional networks for biomedical image segmentation[C]//Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention.Berlin, Germany:Springer, 2015:234-241.
[20] DEVLIN J, CHANG M W, LEE K, et al.BERT:pre-training of deep bidirectional transformers for language understanding[EB/OL].[2022-02-05].https://arxiv.org/abs/1810.04805.
[21] MARTIN C.Hacking MIOpen to run anywhere and libre ML-lightingtalk@COSCUP 2017 postscript[EB/OL].[2022-02-05].https://mightynotes.wordpress.com/2017/08/13/hacking-miopen-to-run-anywhere-and-libre-ml-light ingtalkcoscup-2017-postscript/.
[22] Advanced Micro Devices, Inc.Welcome to AMD ROCm™ platform[EB/OL].[2022-02-05].https://rocmdocs.amd.com/en/latest/index.html.
[23] CHEN T Q, LI M, LI Y T, et al.MXNet:a flexible and efficient machine learning library for heterogeneous distributed systems[EB/OL].[2022-02-05].https://arxiv.org/abs/1512.01274.
[24] MAUDOUX G, MENS K.Correct, efficient, and tailored:the future of build systems[J].IEEE Software, 2018, 35(2):32-37.
[25] NUGTEREN C.CLBlast:a tuned OpenCL BLAS library[C]//Proceedings of International Workshop on OpenCL.New York, USA:ACM Press, 2018:1-10.
[26] Advanced Micro Devices, Inc.clBLAS library user documenta-tion[EB/OL].[2022-02-05].https://github.com/clMath Libraries/clBLAS.
[27] Python Software Foundation.unittest introduction[EB/OL].[2022-02-05].https://docs.python.org/zh-cn/3.7/library/unittest.html.
[28] SENGUPTA A, YE Y T, WANG R, et al.Going deeper in spiking neural networks:VGG and residual architectures[J].Frontiers in Neuroscience, 2019, 13:95.
[29] KRIZHEVSKY A.Learning multiple layers of features from tiny images[EB/OL].[2022-02-05].http://www.cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[30] SANDERS J, KANDROT E.An introduction to general-purpose GPU programming[M].[S.l.]:Addison-Wesley, 2010.
[31] JÄÄSKELÄINEN P, DE LA LAMA C S, SCHNETTER E, et al.pocl:a performance-portable OpenCL implementation[J].International Journal of Parallel Programming, 2015, 43(5):752-785.

选择文件类型/文献管理软件名称

选择包含的内容