Design of Sparse CNN Accelerator Based on Inter-Frame Data Reuse

doi:10.19678/j.issn.1000-3428.0066172

Abstract

Abstract:

Convolutional Neural Network(CNN)are widely used for object detection and other tasks in video applications. However, conventional CNN accelerators focus only on the acceleration of single-image inferences and do not use data redundancy between successive video frames to accelerate video tasks. CNN accelerators currently using inter-frame data reuse have low sparsity, large model size, and high computational complexity. To solve these problems, a design using a learned step-size low-precision quantization is proposed to increase the sparsity of differential frames. Furthermore, the power of two scales is proposed to implement hardware-friendly quantization. This design also uses the Winograd algorithm to reduce the computational complexity of the convolution operator. Based on this, an input-channel bitmap compression scheme is proposed to exploit the sparsity of both activations and weights to leverage full zero skipping. Based on the YOLOv3 tiny network, the proposed quantization method and sparse CNN accelerator are verified on a Field Programmable Gate Array(FPGA) platform using a subset of the ImageNet ILSVRC2015 VID and DAC2020 datasets. The results show that the proposed quantization method achieves 4-bit full-integer quantization with a loss of less than 2% in mean Average Precision(mAP). Owing to interframe data reuse, the designed sparse CNN accelerator achieves a performance of 814.2×10⁹operation/s and an energy efficiency ratio of 201.1×10⁹operation/s/W. Compared with other FPGA-based accelerators, the designed accelerator achieves 1.77-8.99 times higher performance and 1.91-5.56 times higher energy efficiency.

Key words: Convolutional Neural Network(CNN), low-precision quantization, inter-frame data reuse, Winograd algorithm, accelerator, Field Programmable Gate Array(FPGA)

摘要：

卷积神经网络(CNN)被广泛应用于目标检测等任务场景中。然而，传统的CNN加速器只对单帧图像进行加速处理，没有对视频任务中连续帧之间存在的数据冗余特性进行加速处理。目前利用帧间数据复用的CNN加速器存在稀疏度低、模型规模大以及计算复杂度高的缺点。为解决上述问题，通过可学习步长的低精度量化方法提高差分帧的稀疏度，提出量化因子2的幂次约束实现一个硬件友好的量化方法。使用Winograd算法降低卷积算子的计算复杂度，并在此基础上提出输入通道位图压缩方案，利用激活和权重的稀疏性跳过无效的零值计算。基于YOLOv3-tiny网络，使用ImageNet ILSVRC2015 VID部分数据集和DAC2020数据集，在现场可编程门阵列(FPGA)平台上对所提出的量化方法和稀疏CNN加速器进行验证。实验结果表明，在平均精度均值损失小于2%的条件下，该量化方法实现了4 bit位宽的全整形量化。得益于帧间数据复用，所设计的稀疏加速器实现了814.2×10⁹operation/s的性能和201.1×10⁹operation/s/W的能效比，与其他基于FPGA的同类型加速器相比，所设计的加速器提供了1.77~8.99倍的性能提升以及1.91~5.56倍的能效比提升。

关键词: 卷积神经网络, 低精度量化, 帧间数据复用, Winograd算法, 加速器, 现场可编程门阵列

Qirun HONG, Qin WANG. Design of Sparse CNN Accelerator Based on Inter-Frame Data Reuse[J]. Computer Engineering, 2023, 49(12): 55-62.

洪起润, 王琴. 基于帧间数据复用的稀疏CNN加速器设计[J]. 计算机工程, 2023, 49(12): 55-62.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0066172

http://www.ecice06.com/EN/Y2023/V49/I12/55

Figures/Tables 12

Fig.1 Inter-frame data reuse computation mode

Fig.2 Schematic diagram of sparse data flow simplification

Fig.3 Representation format of bitmap sparse data

Fig.4 Bitmap compression scheme of input channel

Fig.5 Overall architecture of accelerator

Fig.6 Sparsity between 4 bit diff-frame and raw-frame

Fig.7 Performance of different optimization mchemes

Fig.8 Performance and sparsity with different input channels

References 25

1	SIMONYAN K, ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Proceedings of the 27th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2014: 568-576.
2	TRAN D, BOURDEV L, FERGUS R, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of 2015 IEEE International Conference on Computer Vision. Washington D. C., USA: IEEE Press, 2015: 4489-4497.
3	CARREIRA J, ZISSERMAN A. Quo vadis, action recognition?A new model and the kinetics dataset[C]//Proceedings of 2017 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2017: 6299-6308.
4	ZHOU S C, NI Z K, ZHOU X Y, et al. DoReFa-net: training low bitwidth convolutional neural networks with low bitwidth gradients[EB/OL]. [2022-09-20]. https://arxiv.org/abs/1606.06160.pdf.
5	JACOB B, KLIGYS S, CHEN B, et al. Quantization and training of neural networks for efficient integer-arithmetic-only inference[C]//Proceedings of 2018 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2018: 2704-2713.
6	ESSER S K, MCKINSTRY J L, BABLANI D, et al. Learned step size quantization[EB/OL]. [2022-09-20]. https://arxiv.org/abs/1902.08153.pdf.
7	LEE J, KIM D, HAM B. Network quantization with element-wise gradient scaling[C]//Proceedings of 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2021: 6444-6453.
8	ZHANG S J, DU Z D, ZHANG L, et al. Cambricon-X: an accelerator for sparse neural networks[C]//Proceedings of the 49th Annual IEEE/ACM International Symposium on Microarchitecture. Washington D. C., USA: IEEE Press, 2016: 1-12.
9	CHEN Y H, KRISHNA T, EMER J S, et al. Eyeriss: an energy-efficient reconfigurable accelerator for deep convolutional neural networks. IEEE Journal of Solid-State Circuits, 2017, 52(1): 127- 138. doi: 10.1109/JSSC.2016.2616357
10	PARASHAR A, RHU M, MUKKARA A, et al. SCNN: An accelerator for compressed-sparse convolutional neural network. ACM SIGARCH Computer Architecture News, 2017, 45(2): 27- 40. doi: 10.1145/3140659.3080254
11	RIERA M, ARNAU J M, GONZALEZ A. Computation reuse in DNNs by exploiting input similarity[C]//Proceedings of the 45th ACM/IEEE Annual International Symposium on Computer Architecture. Washington D. C., USA: IEEE Press, 2018: 57-68.
12	YUAN Z, YANG Y X, YUE J S, et al. 14.2 A 65 nm 24.7 µJ/frame 12.3 mW activation-similarity-aware convolutional neural network video processor using hybrid precision, inter-frame data reuse and mixed-bit-width difference-frame data codec[C]//Proceedings of 2020 IEEE International Solid-State Circuits Conference. Washington D. C., USA: IEEE Press, 2020: 232-234.
13	LI S Z, WANG Q, JIANG J F, et al. An efficient CNN accelerator using inter-frame data reuse of videos on FPGAs. IEEE Transactions on Very Large Scale Integration Systems, 2022, 30(11): 1587- 1600. doi: 10.1109/TVLSI.2022.3151788
14	LI S C, WEN W, WANG Y, et al. An FPGA design framework for CNN sparsification and acceleration[C]//Proceedings of the 25th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines. Washington D. C., USA: IEEE Press, 2017: 28-28.
15	MENG J A, VENKATARAMANAIAH S K, ZHOU C T, et al. FixyFPGA: efficient FPGA accelerator for deep neural networks with high element-wise sparsity and without external memory access[C]//Proceedings of the 31st International Conference on Field-Programmable Logic and Applications. Washington D. C., USA: IEEE Press, 2021: 9-16.
16	狄新凯, 杨海钢. 基于FPGA的稀疏化卷积神经网络加速器. 计算机工程, 2021, 47(7): 189-195, 204. URL
	DI X K, YANG H G. FPGA-based accelerator for sparse convolutional neutral network. Computer Engineering, 2021, 47(7): 189-195, 204. URL
17	LU L Q, LIANG Y. SpWA: an efficient sparse winograd convolutional neural networks accelerator on FPGAs[C]//Proceedings of the 55th ACM/ESDA/IEEE Design Automation Conference. Washington D. C., USA: IEEE Press, 2018: 1-6.
18	WANG X A, WANG C, CAO J, et al. WinoNN: optimizing FPGA-based convolutional neural network accelerators using sparse winograd algorithm. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 2020, 39(11): 4290- 4302. doi: 10.1109/TCAD.2020.3012323
19	YANG T, HE Z Z, KOU T C, et al. BISWSRBS: a winograd-based CNN accelerator with a fine-grained regular sparsity pattern and mixed precision quantization. ACM Transactions on Reconfigurable Technology and Systems, 2021, 14(4): 1- 28.
20	LAVIN A, GRAY S. Fast algorithms for convolutional neural networks[C]//Proceedings of 2016 IEEE Conference on Computer Vision and Pattern Recognition. Washington D. C., USA: IEEE Press, 2016: 4013-4021.
21	IOFFE S, SZEGEDY C. Batch normalization: accelerating deep network training by reducing internal covariate shift[C]//Proceedings of the 32nd International Conference on Machine Learning. Washington D. C., USA: IEEE Press, 2015: 448-456.
22	CHEN Y H, EMER J, SZE V. Eyeriss: a spatial architecture for energy-efficient dataflow for convolutional neural networks[C]//Proceedings of the 43rd ACM/IEEE Annual International Symposium on Computer Architecture. Washington D. C., USA: IEEE Press, 2016, 44(3): 367-379.
23	AHMAD A, PASHA M A, RAZA G J. Accelerating tiny YOLOv3 using FPGA-based hardware/software co-design[C]//Proceedings of 2020 IEEE International Symposium on Circuits and Systems. Washington D. C., USA: IEEE Press, 2020: 1-5.
24	HUANG J M, YANG J Y, NUI S, et al. A low-bit quantized and HLS-based neural network FPGA accelerator for object detection[C]//Proceedings of 2021 China Semiconductor Technology International Conference. Washington D. C., USA: IEEE Press, 2021: 11-23.
25	PESTANA D, MIRANDA P R, LOPES J D, et al. A full featured configurable accelerator for object detection with YOLO. IEEE Access, 2021, 9, 75864- 75877. doi: 10.1109/ACCESS.2021.3081818

[1]	Xianguo LI, Bin LI. Image Deblurring Based on Transformer and Multi-scale CNN [J]. Computer Engineering, 2023, 49(9): 226-233, 245.
[2]	Yixiao DU, Hongjun WANG, Xiuhe LI. Research on Fingerprint Positioning Method of Radiation Source Based on Spectrum Map [J]. Computer Engineering, 2023, 49(9): 183-190, 198.
[3]	Lu HAN, Weigang HUO, Yonghui ZHANG, Tao LIU. Multivariate Time Series Forecasting Based on Multi-Scale Feature Fusion and Dual-Attention Mechanism [J]. Computer Engineering, 2023, 49(9): 99-108.
[4]	SHEN Xueli, TIAN Guiyuan, JIANG Yanji, MA Linlin. Time-Frequency Domain Speech Enhancement Algorithm Based on Dual-Stage Conv-Transformer [J]. Computer Engineering, 2023, 49(6): 123-130.
[5]	DING Zixuan, YU Lei, ZHANG Juan, LI Xiang, WANG Xinyu. Image Super-Resolution Reconstruction Based on Depth Residual Adaptive Attention Network [J]. Computer Engineering, 2023, 49(5): 231-238.
[6]	CHEN Zhixu, JIN Yanxia, LU Ye, YANG Jing, LIU Yabian, SHI Zhiru. Multi-Precision Clothing Modeling Method Based on Subgraph Convolutional Neural Network [J]. Computer Engineering, 2023, 49(4): 174-181.
[7]	CHEN Rui, SUN Yufei, GUO Qiang, SUI Yicheng, ZHOU Zhenhui, SHI Changqing, ZHANG Yuzhi. OclDNN: A General-Purpose DNN Library for TensorFlow [J]. Computer Engineering, 2023, 49(4): 138-148.
[8]	YANG Jingjing, XIE Haiyan, XUE Nini, ZHANG Aoming. Research on Underwater Image Denoising Based on Dual-Channels Residual Network [J]. Computer Engineering, 2023, 49(4): 188-198.
[9]	ZHONG Baorong, WU Xialing. Research on Lightweight Human Pose Estimation Based on High-Resolution Network [J]. Computer Engineering, 2023, 49(4): 226-232,239.
[10]	LIU Jingjng, HUANG Hao. Fundamental Frequency Extraction Model Using Convolutional Neural Networks with Non-local Modules [J]. Computer Engineering, 2023, 49(3): 128-133,160.
[11]	ZOU Changlong, AN Jingmin, LI Guanyu. Knowledge Graph Entity Type Completion Based on Neighborhood Aggregation and CNN [J]. Computer Engineering, 2023, 49(3): 134-141.
[12]	CHENG Xiaohui, LI Yu, KANG Yanping. Double Standard Pruning of Convolution Network Based on Feature Extraction of Intermediate Graph [J]. Computer Engineering, 2023, 49(3): 105-112.
[13]	XU Hong, JIAO Guie, ZHANG Wenjun, CHEN Yimin. Classification Algorithm for Structured Imbalanced Data Based on Convolutional Neural Network [J]. Computer Engineering, 2023, 49(2): 81-89.
[14]	WANG Zhen, LI Hao, YAN Dongmei, ZHU Yongrong. Pavement Disease Detection Model Based on Improved YOLOv5 [J]. Computer Engineering, 2023, 49(2): 15-23.
[15]	Yi CHEN, Bosheng LIU, Yongqi XU, Jigang WU. FPGA Accelerator Design for Hybrid Precision Frequency Domain Convolutional Neural Network [J]. Computer Engineering, 2023, 49(12): 1-9.

Please choose a citation manager

Content to export