作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (9): 71-79. doi: 10.19678/j.issn.1000-3428.0069598

• 人工智能与模式识别 • 上一篇    下一篇

Winograd异构采样窗口卷积加速算子

彭允1,2, 王玉冰1,*(), 梁磊1, 宋悦1, 邱橙1, 雷宇鑫1, 贾鹏1, 缪国庆1, 秦莉1, 王立军1   

  1. 1. 中国科学院长春光学精密机械与物理研究所发光学及应用国家重点实验室, 吉林 长春 130033
    2. 中国科学院大学材料与光电研究中心, 北京 100049
  • 收稿日期:2024-03-18 修回日期:2024-05-16 出版日期:2025-09-15 发布日期:2024-08-23
  • 通讯作者: 王玉冰
  • 基金资助:
    吉林省科技发展计划(20230201033GX); 长春市优秀青年科技人才(23YQ18); 国家自然科学基金(62090054); 国家重点研发计划(2022YFB2803500); 吉林省国际合作项目(20230502005GH); 中国工程院院地合作项目(JL2023-16)

Winograd Heterogeneous Sampling Window Convolution Acceleration Operator

PENG Yun1,2, WANG Yubing1,*(), LIANG Lei1, SONG Yue1, QIU Cheng1, LEI Yuxin1, JIA Peng1, MIAO Guoqing1, QIN Li1, WANG Lijun1   

  1. 1. State Key Laboratory of Luminescence Science and Technology, Changchun Institute of Optics, Fine Mechanics and Physics, Chinese Academy of Sciences, Changchun 130033, Jilin, China
    2. Materials and Optoelectronics Research Center, University of Chinese Academy of Sciences, Beijing 100049, China
  • Received:2024-03-18 Revised:2024-05-16 Online:2025-09-15 Published:2024-08-23
  • Contact: WANG Yubing

摘要:

近年来,人工智能在大模型、自动驾驶、机器人等领域得到广泛应用。神经网络作为人工智能的核心,具备大数据处理、学习适应复杂模式和完成各种任务的功能。神经网络通常利用卷积运算提取输入数据的局部特征,帮助其学习并理解图像、声音等数据的结构和模式。然而,在一次卷积运算过程中涉及密集的乘累加运算,占据了绝大部分的卷积运算时间,成为了神经网络实时部署的技术瓶颈。从硬件层面加速卷积运算,提出一种Winograd异构采样窗口卷积加速算子,采用异构4×2采样窗口提升数据利用率,采用流水线、定点化等手段设计Winograd硬件加速模块,提出基于池化融合的ReLU模块。在现场可编辑逻辑门阵列(FPGA)上进行原型验证实验,实验结果表明,所提算子对比单路原始滑窗卷积共获得86.4倍的加速比,对比三路滑窗卷积获得28.8倍的加速比,读写数据量减少为原来的11.07%,资源消耗比同类型Winograd卷积加速算子低,对比快速傅里叶变换(FFT)有明显优势,具备大规模集成和构建卷积神经网络的能力。

关键词: Winograd, 卷积加速算子, 硬件加速, 异构采样, 现场可编辑逻辑门阵列

Abstract:

In recent years, Artificial Intelligence (AI) has been widely used in fields such as large models, autonomous driving, and robotics. As the core of AI, neural networks process big data, learn, adapt complex patterns, and perform various tasks. For the implementation of neural networks, convolution algorithms are often used to extract local features of the input data to help them learn to understand the structure and pattern of data such as images and sounds. However, convolution computation involves intensive multiplication and accumulation operations and is a time intensive process, thus becoming a major obstacle for the real-time implementation of neural networks. In this study, to accelerate the convolution algorithm at the hardware level, a Winograd convolution acceleration operator based on a heterogeneous sampling window, which adopts a heterogeneous 4 × 2 sampling window to improve data utilization, is proposed. Additionally, a Winograd hardware acceleration module is designed using a pipeline and fixed-point structure, and a ReLU module based on pooling fusion is proposed. A prototype verification experiment is conducted on a Field Programmable Gate Array (FPGA). Finally, the acceleration ratio of the single-channel original sliding window convolutions is 86.4 and that of the three-channel sliding window convolutions is 28.8. The amount of read and write data is reduced to 11.07% of the original, and the resource consumption is lower than that of the Winograd convolution acceleration operators of the same type. It has the ability to integrate and build convolutional neural networks on a large scale, and compared with the Fast Fourier Transformation (FFT), it has distinct advantages.

Key words: Winograd, convolutional acceleration operator, hardware acceleration, heterogeneous sampling, Field Programmable Gate Array (FPGA)