基于块浮点的二维卷积频域加速处理单元设计

doi:10.19678/j.issn.1000-3428.0253320

摘要/Abstract

摘要： 块浮点（BFP）因其独特的数值表示方式，广泛应用于卷积神经网络的卷积计算。特别是，频域卷积通过将空间域卷积运算转化为频域复数乘法，能够显著降低计算复杂度，从而实现高效的神经网络部署。然而，现有研究主要集中于基于BFP的空间域卷积加速计算或频域中的定点卷积加速计算，尚未充分挖掘 BFP 数值格式与频域卷积相结合在推理时延改进与资源效率优化方面的潜力。本文提出一种基于BFP的频域处理单元，该单元利用现场可编程门阵列中的数字信号处理器资源的结构特性，结合BFP 数据格式的指数共享机制，实现多个复数乘法运算的打包执行，以提升整体计算性能。此外，本文提出一种面向BFP频域卷积的数据流映射方法，在频域卷积中最大化BFP指数部分与尾数部分的数据重用。在具有代表性的卷积神经网络模型基准测试中，对所提出的频域 BFP 加速设计进行系统评估。评估结果表明，与当前先进的基于 BFP 的空间域卷积加速基线方案相比，该方法在推理时延方面最高可实现 5.4倍的改进，在资源效率方面实现8.5倍的优化。

Abstract: Block floating point (BFP), with its distinctive data representation, has been extensively applied in convolution calculations for convolutional neural networks. In particular, frequency-domain convolution transforms spatial-domain convolution into complex multiplications in the frequency domain, significantly reducing computational complexity and enabling efficient neural network deployment. However, existing studies mainly focus on BFP-based convolution acceleration in the spatial domain or fixed-point acceleration in the frequency domain, leaving the potential of combining the BFP numerical format with frequency-domain convolution underexplored in terms of inference latency reduction and resource efficiency optimization. In this work, we propose a BFP-based frequency-domain processing unit that exploits the structural characteristics of digital signal processing blocks in field-programmable gate arrays. By leveraging the exponent-sharing mechanism of the BFP format, the proposed design enables packed execution of multiple complex multiplications, thereby improving overall computational performance. Furthermore, we introduce a dataflow mapping method tailored for BFP-based frequency-domain convolution, which maximizes the reuse of both exponent and mantissa components of BFP data during frequency-domain processing. We conduct a systematic evaluation of the proposed frequency-domain BFP acceleration design on representative convolutional neural network benchmarks. Experimental results demonstrate that, the proposed approach achieves up to 5.4× speedup in inference latency and 8.5× gain in resource efficiency, compared with state-of-the-art BFP-based spatial-domain convolution acceleration baselines.

焦梦茹, 刘耀扬, 刘博生, 武继刚. 基于块浮点的二维卷积频域加速处理单元设计[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0253320.

Jiao Mengru, Liu Yaoyang, Liu Bosheng, Wu Jigang. A Frequency-domain Processing Element for Accelerating 2D Convolution based on Block Floating Point[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0253320.

参考文献

[1] ALI S B, FILIP S I, SENTIEYS O. A stochastic rounding-enabled low-precision floating-point mac for dnn training[C]//2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024: 1-6.
[2] WONG Y, DONG Z, ZHANG W. Low bitwidth CNN accelerator on FPGA using Winograd and block floating point arithmetic[C]//2021 IEEE Computer Society Annual Symposium on VLSI (ISVLSI). IEEE, 2021: 218-223.
[3] NI C, LU J, LIN J, et al. LBFP: Logarithmic block floating point arithmetic for deep neural networks[C], IEEE Asia Pacific Conference on Circuits and Systems (APCCAS). IEEE, 2020: 201-204.
[4] 彭允, 王玉冰, 梁磊, 等. Winograd异构采样窗口卷积加速算子[J]. 计算机工程, 2025, 51(9): 71-79. PENG YUN, WANG YUBING, LIANG LEI, et al. Winograd Heterogeneous Sampling Window Convolution Acceleration Operator[J]. Computer Engineering, 2025, 51(9): 71-79.
[5] ZOU L, ZHAO W, YIN S, et al. BiE: bi-exponent block floating-point for large language models quantization[C]//Forty-first International Conference on Machine Learning. 2024.
[6] HAN X, CHENG Y, WANG J, et al. Bbal: A bidirectional block floating point-based quantisation accelerator for large language models[C]//2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025: 1-7.
[7] SONG Z, LIU Z, WANG D. Computation error analysis of block floating point arithmetic oriented convolution neural network accelerator design[C]//Proceedings of the AAAI Conference on Artificial Intelligence. 2018, 32(1).
[8] LEE S, CHOI J, NOH S, et al. DBPS: dynamic block size and precision scaling for efficient DNN training supported by RISC-V ISA extensions[C]//2023 60th ACM/IEEE Design Automation Conference (DAC).IEEE,2023:1-6.
[9] NASCIMENTO M G, PRISACARIU V A, FAWCETT R, et al. Hyperblock floating point: Generalised quantization scheme for gradient and inference computation[C]//Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2023: 6364-6373.
[10] LO Y C, LIU R S. Bucket getter: A bucket-based processing engine for low-bit block floating point (bfp) dnns[C]//Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture. 2023: 1002-1015.
[11] LU L, LIANG Y, XIAO Q, et al. Evaluating fast algorithms for convolutional neural networks on FPGAs[C]//2017 IEEE 25th a
nnual international symposium on field-programmable custom computing machines (FCCM). IEEE, 2017: 101-108. [12] AHMAD A, PASHA M A. FFConv: an FPGA-based accelerator for fast convolution layers in convolutional neural networks[J]. ACM Transactions on Embedded Computing Systems (TECS), 2020, 19(2): 1-24.
[13] ELEFTHERIADIS C, KARAKONSTANTIS G. Energy-efficient fast Fourier transform for real-valued applications[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2022, 69(5): 2458-2462.
[14] ZHANG F, GAO Z, HUANG J, et al. HFOD: A hardware-friendly quantization method for object detection on embedded FPGAs[J]. IEICE Electronics Express, 2022, 19(8): 20220067-20220067.
[15] FRASSER C F, LINARES-SERRANO P, DE LOS RÍOS I D, et al. Fully parallel stochastic computing hardware implementation of convolutional neural networks for edge computing applications[J]. IEEE Transactions on Neural Networks and Learning Systems, 2022, 34(12): 10408-10418.
[16] 关明晓, 刘嘉堃, 张鸿锐, 等. 基于FPGA误差可控的浮点运算加速器研究[J]. 计算机工程, 2024, 50(5): 291-297. GUAN MINGXIAO, LIU JIAKUN, ZHANG HONGRUI, et al. Study of FPGA-based Error-controllable Floating-point Operation Accelerators[J]. Computer Engineering, 2024, 50(5): 291-297.
[17] LEE J, LEE W, SIM J. Tender: Accelerating large language models via tensor decomposition and runtime requantization[C]//2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 2024: 1048-1062.
[18] NOH S H, KOO J, LEE S, et al. FlexBlock: A flexible DNN training accelerator with multi-mode block floating point support[J]. IEEE Transactions on Computers, 2023, 72(9): 2522-2535.
[19] ZHAO W, DANG Q, XIA T, et al. Optimizing FPGA-Based DNN accelerator with shared exponential floating-point format[J]. IEEE Transactions on Circuits and Systems I: Regular Papers, 2023, 70(11): 4478-4491.
[20] Fang C, Shi M, Geens R, et al. Anda: Unlocking Efficient LLM Inference with a Variable-Length Grouped Activation Data Format[C]//2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2025: 1467-1481.
[21] ZHANG S Q, MCDANEL B, KUNG H T. Fast: Dnn training under variable precision block floating point with stochastic rounding[C]//2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2022: 846-860.
[22] LANGHAMMER M, GRIBOK S, BAECKLER G. High density 8-bit multiplier systolic arrays for FPGA[C]//2020 IEEE 28th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 2020: 84-92.
[23] JANG J, KIM Y, LEE J, et al. Figna: Integer unit-based accelerator design for fp-int gemm preserving numerical accuracy[C]//2024 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, 2024: 760-773.
[24] LIU S, FAN H, LUK W. Accelerating fully spectral CNNs with adaptive activation functions on FPGA[C]//2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), 2021: 1530-1535.
[25] WANG X, ZHOU Z, YUAN Z, et al. FD-CNN: A Frequency-Domain FPGA Acceleration Scheme for CNN-Based Image-Processing Applications[J]. ACM Transactions on Embedded Computing Systems, 2023, 22(6): 1-30.
[26] YANG J, YUNE S, LIM S, et al. ACane: An Efficient FPGA-based Embedded Vision Platform with Accumulation-as-Convolution Packing for Autonomous Mobile Robots[C]//2024 29th Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, 2024: 533-538.
[27] DOON R, RAWAT T K, GAUTAM S. Cifar-10 classification using deep convolutional neural network[C]//2018 IEEE punecon. IEEE, 2018: 1-5.
[28] DENG J, DONG W, SOCHER R, et al. Imagenet: A large-scale hierarchical image database[C]//2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 2009: 248-255.
[29] HE K, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[30] SIMONYAN K, ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.
[31] PRASAD B M P, PARANE K, TALAWAR B. High-performance NoC simulation acceleration framework employing the xilinx DSP48E1 blocks[C]//2019 International Symposium on VLSI Design, Automation and Test (VLSI-DAT). IEEE, 2019: 1-4.

选择文件类型/文献管理软件名称

选择包含的内容