Winograd异构采样窗口卷积加速算子

doi:10.19678/j.issn.1000-3428.0069598

摘要/Abstract

摘要：

近年来，人工智能在大模型、自动驾驶、机器人等领域得到广泛应用。神经网络作为人工智能的核心，具备大数据处理、学习适应复杂模式和完成各种任务的功能。神经网络通常利用卷积运算提取输入数据的局部特征，帮助其学习并理解图像、声音等数据的结构和模式。然而，在一次卷积运算过程中涉及密集的乘累加运算，占据了绝大部分的卷积运算时间，成为了神经网络实时部署的技术瓶颈。从硬件层面加速卷积运算，提出一种Winograd异构采样窗口卷积加速算子，采用异构4×2采样窗口提升数据利用率，采用流水线、定点化等手段设计Winograd硬件加速模块，提出基于池化融合的ReLU模块。在现场可编辑逻辑门阵列(FPGA)上进行原型验证实验，实验结果表明，所提算子对比单路原始滑窗卷积共获得86.4倍的加速比，对比三路滑窗卷积获得28.8倍的加速比，读写数据量减少为原来的11.07%，资源消耗比同类型Winograd卷积加速算子低，对比快速傅里叶变换(FFT)有明显优势，具备大规模集成和构建卷积神经网络的能力。

关键词: Winograd, 卷积加速算子, 硬件加速, 异构采样, 现场可编辑逻辑门阵列

Abstract:

In recent years, Artificial Intelligence (AI) has been widely used in fields such as large models, autonomous driving, and robotics. As the core of AI, neural networks process big data, learn, adapt complex patterns, and perform various tasks. For the implementation of neural networks, convolution algorithms are often used to extract local features of the input data to help them learn to understand the structure and pattern of data such as images and sounds. However, convolution computation involves intensive multiplication and accumulation operations and is a time intensive process, thus becoming a major obstacle for the real-time implementation of neural networks. In this study, to accelerate the convolution algorithm at the hardware level, a Winograd convolution acceleration operator based on a heterogeneous sampling window, which adopts a heterogeneous 4 × 2 sampling window to improve data utilization, is proposed. Additionally, a Winograd hardware acceleration module is designed using a pipeline and fixed-point structure, and a ReLU module based on pooling fusion is proposed. A prototype verification experiment is conducted on a Field Programmable Gate Array (FPGA). Finally, the acceleration ratio of the single-channel original sliding window convolutions is 86.4 and that of the three-channel sliding window convolutions is 28.8. The amount of read and write data is reduced to 11.07% of the original, and the resource consumption is lower than that of the Winograd convolution acceleration operators of the same type. It has the ability to integrate and build convolutional neural networks on a large scale, and compared with the Fast Fourier Transformation (FFT), it has distinct advantages.

Key words: Winograd, convolutional acceleration operator, hardware acceleration, heterogeneous sampling, Field Programmable Gate Array (FPGA)

彭允, 王玉冰, 梁磊, 宋悦, 邱橙, 雷宇鑫, 贾鹏, 缪国庆, 秦莉, 王立军. Winograd异构采样窗口卷积加速算子[J]. 计算机工程, 2025, 51(9): 71-79.

PENG Yun, WANG Yubing, LIANG Lei, SONG Yue, QIU Cheng, LEI Yuxin, JIA Peng, MIAO Guoqing, QIN Li, WANG Lijun. Winograd Heterogeneous Sampling Window Convolution Acceleration Operator[J]. Computer Engineering, 2025, 51(9): 71-79.

https://www.ecice06.com/CN/Y2025/V51/I9/71

图/表 15

图1 滑动窗口卷积

Fig.1 Sliding window convolution

图2 加速算子架构

Fig.2 Architecture of acceleration operator

图3 Winograd卷积算法模块

Fig.3 Winograd convolution algorithm module

图4 4×4和4×2采样示意图

Fig.4 Schematic diagram of 4×4 and 4×2 sampling

图5 数据重排示意图

Fig.5 Schematic diagram of data rearrangement

图6 PE控制器状态转换示意图

Fig.6 Schematic diagram of PE controller status conversion

图7 基于池化融合的ReLU模块架构

Fig.7 Architecture of ReLU module based on pooling fusion

图8 Winograd卷积算法模块仿真波形

Fig.8 Simulates waveform of Winograd convolution algorithm module

图9 PE控制器模块与数据缓存单元联合仿真实验波形

Fig.9 Joint simulation experiment waveform of PE controller module and data cache unit

图10 异构缓存数据重排结果

Fig.10 Rearrangement results of heterogeneous cache data

图11 特征图和热图对比

Fig.11 Comparison between characteristic map and heat map

图12 卷积特征直方图

Fig.12 Convolution feature histogram

参考文献 25

1	YIN J, GAN C J, ZHAO K Q, et al. A novel model for imbalanced data classification[C]//Proceedings of the AAAI Conference on Artificial Intelligence. [S. l. ]: AAAI Press, 2020: 6680-6687.
2	LIN X , QUAN Z , WANG Z J , et al. A novel molecular representation with BiGRU neural networks for learning atom. Briefings in Bioinformatics, 2020, 21 (6): 2099- 2111. doi: 10.1093/bib/bbz125
3	WANG T, WANG C, ZHOU X H, et al. An overview of FPGA based deep learning accelerators: challenges and opportunities[C]//Proceedings of the 21st International Conference on High Performance Computing and Communications. Washington D.C., USA: IEEE Press, 2019: 1674-1681.
4	CHEN T W, HSIEH H A, FAN Y C. High speed Winograd convolutional circuit for convolutional neural networks[C]//Proceedings of IEEE International Conference on Consumer Electronics. Washington D.C., USA: IEEE Press, 2022: 347-355.
5	TU F , WU Z , WANG Y , et al. TranCIM: full-digital bitline-transpose CIM-based sparse transformer accelerator with pipeline/parallel reconfigurable modes. IEEE Journal of Solid-State Circuits, 2023, 58 (6): 1798- 1809. doi: 10.1109/JSSC.2022.3213542
6	LE Y Q, WANG Z J, QUAN Z, et al. ACV-tree: a new method for sentence similarity modeling[C]//Proceedings of the 27th International Joint Conference on Artificial Intelligence. New York, USA: ACM Press, 2018: 4137-4143.
7	MAKHATAEVA Z , VAROL H A . Augmented reality for robotics: a review. Robotics, 2020, 9 (2): 21. doi: 10.3390/robotics9020021
8	XIE J, WAN S H, JIN P Q. Fast and effective object classification for big image data[C]//Proceedings of the 2020 IEEE International Conference on Big Data. Washington D. C., USA: IEEE Press, 2020: 5852-5854.
9	CHEN Z P , YAO B , WANG Z J , et al. ITISS: an efficient framework for querying big temporal data. GeoInformatica, 2020, 24, 27- 59. doi: 10.1007/s10707-019-00362-1
10	罗锦钊, 孙玉龙, 钱增志, 等. 人工智能大模型综述及展望. 无线电工程, 2023, 53 (11): 2461- 2472.
	LUO J Z , SUN Y L , QIAN Z Z , et al. Overview and prospect of artificial intell-igence large models. Radio Engineering, 2023, 53 (11): 2461- 2472.
11	HABIB G , QURESHI S . Optimization and acceleration of convolutional neural networks: a survey. Journal of King Saud University-Computer and Information Sciences, 2022, 34 (7): 4244- 4268. doi: 10.1016/j.jksuci.2020.10.004
12	AHMAD A , PASHA M A . FFConv: an FPGA-based accelerator for fast convolution layers in convolutional neural networks. ACM Transactions on Embedded Computing Systems, 2020, 19 (2): 1- 24.
13	王庆林, 李东升, 梅松竹, 等. 面向飞腾多核处理器的Winograd快速卷积算法优化. 计算机研究与发展, 2020, 57 (6): 1140- 1151. doi: 10.7544/issn1000-1239.2020.20200107
	WANG Q L , LI D S , MEI S Z , et al. Optimizing Winograd-based fast convolution algorithm on phytium multi-core CPUs. Journal of Computer Research and Development, 2020, 57 (6): 1140- 1151. doi: 10.7544/issn1000-1239.2020.20200107
14	HU Y, LIU Y, LIU Z. A survey on convolutional neural network accelerators: GPU, FPGA and ASIC[C]//Proceedings of the 14th International Conference on Computer Research and Development (ICCRD). Washington D. C., USA: IEEE Press, 2022: 100-107.
15	吴瑞东, 刘冰, 付平, 等. 应用于极致边缘计算场景的卷积神经网络加速器架构设计. 电子与信息学报, 2023, 45 (6): 1933- 1943. URL
	WU R D , LIU B , FU P , et al. Convolutional neural network accelerator architecture design for ultimate edge computing scenario. Journal of Electronics & Information Technology, 2023, 45 (6): 1933- 1943. URL
16	SINGH R , GILL S S . Edge AI: a survey. Internet of Things and Cyber-Physical Systems, 2023, 3, 71- 92. doi: 10.1016/j.iotcps.2023.02.004
17	WINOGRAD S . Arithmetic complexity of computations. Berlin, Germany: Springer, 1980.
18	HUANG C, DONG X, LI Z, et al. Efficient stride 2 Winograd convolution method using unified transformation matrices on FPGA[C]//Proceedings of the 2021 International Conference on Field-Programmable Technology (ICFPT). Washington D. C., USA: IEEE Press, 2021: 1-9.
19	焦李成, 孙其功, 杨育婷, 等. 深度神经网络FPGA设计进展、实现与展望. 计算机学报, 2022, 45 (3): 441- 471.
	JIAO L C , SUN Q G , YANG Y T , et al. Development, implementation and prospect of FPGA-based deep neural networks. Chinese Journal of Computers, 2022, 45 (3): 441- 471.
20	ZHANG W. A survey of fpga based cnns accelerators[EB/OL]. [2024-02-14]. https://56www.easychair.org/publications/preprint/GGnJ/download.
21	SABIR D , HANIF M A , HASSAN A , et al. TiQSA: workload minimization in convolutional neural networks using tile quantization and symmetry approximation. IEEE Access, 2021, 9, 53647- 53468. doi: 10.1109/ACCESS.2021.3069906
22	李双峰. TensorFlow Lite: 端侧机器学习框架. 计算机研究与发展, 2020, 57 (9): 1839- 1853. doi: 10.11897/SP.J.1016.2022.00441
	LI S F . TensorFlow Lite: on-device machine learning framework. Journal of Computer Research and Development, 2020, 57 (9): 1839- 1853. doi: 10.11897/SP.J.1016.2022.00441
23	IE L, ZHANG A, YANG W, et al. Remaining useful life prediction of lithium batteries based on CNN-GRU model[C]//Proceedings of the 6th Conference on Energy Internet and Energy System Integration (EI2). Washington D. C., USA: IEEE Press, 2022: 1683-1688.
24	LIU X, CHEN Y, HAO C, et al. WinoCNN: kernel sharing Winograd systolic array for efficient convolutional neural network acceleration on FPGAs[C]//Proceedings of the 32nd International Conference on Application-Specific Systems, Architectures and Processors (ASAP). Washington D. C., USA: IEEE Press, 2021: 258-265.
25	张多利, 沈休垒, 宋宇鲲, 等. 基于异构多核可编程系统的大点FFT卷积设计与实现. 电子技术应用, 2017, 43 (3): 16- 20.
	ZHANG D L , SHEN X L , SONG Y K , et al. Design and implementation of large FFT convolution on heterogeneous multicore programmable system. Application of Electronic Technique, 2017, 43 (3): 16- 20.

[1]	杨思捷, 陈俊奇, 王勇, 李树林. 基于FPGA的软硬件协同纠删码编码加速方案[J]. 计算机工程, 2024, 50(2): 224-231.
[2]	陈锐, 孙羽菲, 郭强, 隋轶丞, 周振辉, 石昌青, 张玉志. OclDNN:一种可应用于TensorFlow的通用DNN库[J]. 计算机工程, 2023, 49(4): 138-148.
[3]	陈逸, 刘博生, 徐永祺, 武继刚. 混合精度频域卷积神经网络FPGA加速器设计[J]. 计算机工程, 2023, 49(12): 1-9.
[4]	洪起润, 王琴. 基于帧间数据复用的稀疏CNN加速器设计[J]. 计算机工程, 2023, 49(12): 55-62.
[5]	黄正伟, 刘宏伟, 徐渊. 用于IToF传感器的极低功耗RISC-V专用处理器设计[J]. 计算机工程, 2022, 48(9): 146-154.
[6]	巩杰, 赵烁, 何虎, 邓宁. 基于FPGA的量化CNN加速系统设计[J]. 计算机工程, 2022, 48(3): 170-174,196.
[7]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[8]	廖汉松, 吴朝晖, 李斌. 基于RISC-V的卷积神经网络专用指令集处理器[J]. 计算机工程, 2021, 47(7): 196-204.
[9]	范宏伟,胡宇翔,兰巨龙. 基于FPGA的虚拟网络功能数据包处理加速架构[J]. 计算机工程, 2018, 44(8): 112-119,126.
[10]	金礼聪,郭冉,陈朋,党源杰,孙文俊. 基于FPGA声学多普勒流速剖面仪的信号处理机设计[J]. 计算机工程, 2017, 43(1): 67-71.
[11]	赵喜全, 刘兴奎, 邵宗有, 刘朝辉, 窦晓光, 赵晓芳. 基于FPGA的TOE网卡设计与实现[J]. 计算机工程, 2011, 37(3): 241-243,247.
[12]	唐永鹤, 胡谋法, 卢焕章. 抗噪型Sobel边缘检测算法及其硬件加速设计[J]. 计算机工程, 2011, 37(24): 204-206.
[13]	段欣, 陈利光, 王健, 来金梅, 鲍丽春. 可进化芯片的FPGA接口设计与实现[J]. 计算机工程, 2011, 37(13): 13-16.
[14]	曾宇;王洁;孙凝晖. 曙光5000A高效能计算节点的设计与实现[J]. 计算机工程, 2009, 35(6): 17-19.
[15]	盛怀亮;林涛. 全高清CABAC解码器的设计与实现[J]. 计算机工程, 2008, 34(19): 236-238,.

选择文件类型/文献管理软件名称

选择包含的内容