Optimization of 3D Convolutional Forward Operators Based on Domestic Accelerators

doi:10.19678/j.issn.1000-3428.0068480

Abstract

Abstract:

The current application scenarios of a three-dimensional (3D) Convolutional Neural Network (3D CNN) are increasingly extensive. 3D CNN can extract richer and more discriminative feature information from the original data, which is crucial in processing 3D data, feature extraction, and practical applications. However, the shift from two-dimensional (2D) to 3D data has exponentially increased both the amount of data and computation required for convolution operations, thus increasing computational resources and time. This can lead to more time-consuming training and inference processes, particularly when dealing with large-scale 3D data. To solve these problems, this study proposes an implicit convolution algorithm based on a domestic accelerator to optimize the forward calculation process of 3D convolution. First, the algorithm combines hardware characteristics and parallelization idea, by using an index to directly access the required data address without allocating additional memory space, thereby considerably reducing the memory overhead. Second, the domestic accelerator has a highly parallel computing structure and rich computing resources, which are suitable for processing large-scale data and complex computing tasks. Finally, using various specific heterogeneous parallel optimization algorithms combined with computing power and architecture characteristics of domestic accelerators significantly accelerates the computational process of 3D convolutional forward operators and improves computational efficiency and performance. The experimental results indicate that the performance of the self-developed operators significantly exceeds the optimal performance of existing domestic computing platform operators, and the energy efficiency ratio with NVIDIA V100 can basically reach 70% or higher.

Key words: 3D convolution, domestic accelerator, implicit convolution algorithm, indexing mechanism, forward operator optimization, parallel optimization algorithm

摘要：

目前三维卷积神经网络(3D CNN)的应用场景越来越广泛, 其能够从原始数据中提取更丰富、更具判别性的特征信息, 在处理3D数据、特征提取和实际应用等方面具有重要意义。然而, 从二维(2D)数据到3D数据的转变导致了卷积运算的数据量和计算量均呈指数级增长, 对计算资源和时间的需求也相应增加, 这会导致训练和推理过程更加耗时, 特别是在处理大规模3D数据时尤为明显。针对以上问题, 提出一种基于国产加速器的隐式卷积算法, 对3D卷积的前向计算过程进行优化。首先, 该算法结合了硬件特点和并行化思路, 利用索引直接访问所需计算的数据地址, 无须开辟新的内存空间, 大幅节省内存开销; 其次, 考虑到国产加速器具有高度并行的计算结构和丰富的计算资源, 适合处理大规模数据和复杂的计算任务, 结合国产加速器的计算能力和架构特点, 采用一系列特定的异构并行优化算法, 加速3D卷积前向算子的计算过程, 提高计算效率和性能。实验结果表明, 自研算子性能远超国产计算平台现有算子的最优性能, 在多数情况下与NVIDIA V100之间的能效比可以达到70%甚至更高。

关键词: 三维卷积, 国产加速器, 隐式卷积算法, 索引机制, 前向算子优化, 并行优化算法

JI Chenchen, CHEN Yongqing, HAN Mengzhi. Optimization of 3D Convolutional Forward Operators Based on Domestic Accelerators[J]. Computer Engineering, 2025, 51(2): 250-258.

姬晨晨, 陈永青, 韩孟之. 基于国产加速器的三维卷积前向算子优化[J]. 计算机工程, 2025, 51(2): 250-258.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0068480

https://www.ecice06.com/EN/Y2025/V51/I2/250

Figures/Tables 13

Fig.1 Single channel 3D forward convolution calculation process

Fig.2 Im2col transformation process

Fig.3 Comparison of time consumption between domestic operators and V100 operators

Fig.4 Single batch and single channel 3D convolutional GEMM conversion

Fig.5 Schematic diagram of traditional matrix multiplication

Fig.6 Schematic diagram of block matrix multiplication

Fig.7 Multi-level storage access process

Fig.8 Comparison of 1×1×1 convolutional kernel optimization effects

Fig.9 Comparison of 1×3×3 convolutional kernel optimization effects

Fig.10 Comparison of 3×1×1 convolutional kernel optimization effects

Fig.11 Energy efficiency ratio between self-developed operator and V100 operator

References 25

1	JI S , YANG M , YU K . 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2013, 35 (1): 221- 231. doi: 10.1109/TPAMI.2012.59
2	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the IEEE International Conference on Computer Vision. Washington D.C., USA: IEEE Press, 2015: 4489-4497.
3	TRAN D, BOURDEV L, FERGUS R, et al. Deep End2End Voxel2Voxel prediction[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. Washington D.C., USA: IEEE Press, 2016: 17-24.
4	TRAN D, RAY J, SHOU Z, et al. ConvNet architecture search for spatiotemporal feature learning[EB/OL]. [2023-08-05]. http://arxiv.org/abs/1708.05038v1.
5	TRAN D, WANG H, TORRESANI L, et al. A closer look at spatiotemporal convolutions for action recognition[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D.C., USA: IEEE Press, 2018: 6450-6459.
6	ZHANG X F, WANG J S, ZHU C, et al. AccDNN: an IP-based DNN generator for FPGAs[C]// Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. Washington D.C., USA: IEEE Press, 2018: 210-210.
7	GEORGE J K, NEJADRIAHI H, SORGER V J. Towards on-chip optical FFTs for convolutional neural networks[C]//Proceedings of the IEEE International Conference on Rebooting Computing. Washington D.C., USA: IEEE Press, 2017: 1-4.
8	SUDA N, CHANDRA V, DASIKA G, et al. Throughput-optimized OpenCL-based FPGA accelerator for large-scale convolutional neural networks[C]//Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. New York, USA: ACM Press, 2016: 16-25.
9	ZHANG C , SUN G Y , FANG Z M , et al. Caffeine: toward uniformed representation and acceleration for deep convolutional neural networks. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2019, 38 (11): 2072- 2085. doi: 10.1109/TCAD.2017.2785257
10	MITTAL S . A survey of FPGA-based accelerators for convolutional neural networks. Neural Computing and Applications, 2020, 32 (4): 1109- 1139. doi: 10.1007/s00521-018-3761-1
11	HU Y X, LIU Y H, LIU Z Y. A survey on convolutional neural network accelerators: GPU, FPGA and ASIC[C]//Proceedings of the 14th International Conference on Computer Research and Development. Washington D.C., USA: IEEE Press, 2022: 100-107.
12	曹义魁, 陆忠华, 张鉴, 等. 面向国产加速器的CFD核心算法并行优化. 数据与计算发展前沿, 2021, 3 (4): 93- 103.
	CAO Y K , LU Z H , ZHANG J , et al. Parallel optimization of CFD core algorithms based on domestic processor. Frontiers of Data and Computing, 2021, 3 (4): 93- 103.
13	NIELSEN M A . Neural networks and deep learning. San Francisco, USA: Determination Press, 2015.
14	XU R, MA S, GUO Y. Performance analysis of different convolution algorithms in GPU environment[C]//Proceedings of the IEEE International Conference on Networking, Architecture and Storage. Washington D.C., USA: IEEE Press, 2018: 1-10.
15	SHEVGUNOV T , EFIMOV E , GUSCHINA O . Estimation of a spectral correlation function using a time-smoothing cyclic periodogram and FFT interpolation-2N-FFT algorithm. Sensors (Basel, Switzerland), 2022, 23 (1): 215. doi: 10.3390/s23010215
16	童敢, 黄立波. Winograd快速卷积相关研究综述. 计算机科学与探索, 2022, 16 (5): 959- 971.
	TONG G , HUANG L B . A review of research on Winograd fast convolution. Journal of Frontiers of Computer Science & Technology, 2022, 16 (5): 959- 971.
17	NAKASATO N . A fast GEMM implementation on the cypress GPU. ACM SIGMETRICS Performance Evaluation Review, 2011, 38 (4): 50- 55. doi: 10.1145/1964218.1964227
18	武铮, 金旭, 安虹. 申威26010众核处理器上Winograd卷积算法的研究与优化. 计算机研究与发展, 2024, 61 (4): 955- 972.
	WU Z , JIN X , AN H . Research and optimization of Winograd convolution algorithm on Shenwei 26010 multi-core processor. Journal of Computer Research and Development, 2024, 61 (4): 955- 972.
19	JIA Y Q, SHELHAMER E, DONAHUE J, et al. Caffe: convolutional architecture for fast feature embedding[C]//Proceedings of the 22nd ACM International Conference on Multimedia. New York, USA: ACM Press, 2014: 675-678.
20	李茂文, 曲国远, 魏大洲, 等. 面向GPU计算平台的神经网络卷积性能优化. 计算机研究与发展, 2022, 59 (6): 1181- 1191.
	LI M W , QU G Y , WEI D Z , et al. Performance optimization of neural network convolution based on GPU platform. Journal of Computer Research and Development, 2022, 59 (6): 1181- 1191.
21	邬江兴, 祁晓峰, 高彦钊. 异构计算并行编程模型综述. 上海航天(中英文), 2021, 38 (4): 1- 11.
	WU J X , QI X F , GAO Y Z . Overview of heterogeneous computing parallel programming models. Aerospace Shanghai(Chinese and English), 2021, 38 (4): 1- 11.
22	GOTO K , VAN DE GEIJN R A . Anatomy of high-performance matrix multiplication. ACM Transactions on Mathematical Software, 2008, 34 (3): 1- 25.
23	王年华, 常兴华, 赵钟, 等. 非结构CFD软件MPI+OpenMP混合并行及超大规模非定常并行计算的应用. 航空学报, 2020, 41 (10): 185- 199.
	WANG N H , CHANG X H , ZHAO Z , et al. Implementation of hybrid MPI+OpenMP parallelization on unstructured CFD solver and its applications in massive unsteady simulations. Acta Aeronautica et Astronautica Sinica, 2020, 41 (10): 185- 199.
24	田卓, 陈一峯. 神威太湖之光上分子动力学模拟的性能优化. 软件学报, 2021, 32 (9): 2945- 2962.
	TIAN Z , CHEN Y F . Performance optimization of molecular dynamics simulation on Sunway TaihuLight system. Journal of Software, 2021, 32 (9): 2945- 2962.
25	郭建, 丁继政, 朱晓冉. 嵌入式实时操作系统内核混合代码的自动化验证框架. 软件学报, 2020, 31 (5): 1353- 1373.
	GUO J , DING J Z , ZHU X R . An automated verification framework for mixed code of embedded real-time operating system kernels. Journal of Software, 2020, 31 (5): 1353- 1373.

[1]	ZHANG Aihan, LIU Xiang, SHI Yunyu, LIU Siqi. Dual-Process Short Video Classification Method Based on Deep Learning [J]. Computer Engineering, 2022, 48(7): 277-283.
[2]	XU Fang, HUANG Jun, CHEN Quan. Dynamic Gesture Recognition Model Based on 3D Convolutional Neural Network [J]. Computer Engineering, 2021, 47(11): 283-291.
[3]	ZHANG Jiehao, CHEN Huajie, YAO Qinwei, HOU Xinyu. Fast Video Action Detection Based on Action Subject Detection [J]. Computer Engineering, 2019, 45(12): 257-262.
[4]	ZHANG Rui,LI Qishen,CHU Jun. Human Action Recognition Algorithm Based on 3D Convolution Neural Network [J]. Computer Engineering, 2019, 45(1): 259-263.

Please choose a citation manager

Content to export