面向LUNA芯片的Bi-LSTM算子的优化方法

doi:10.19678/j.issn.1000-3428.0252287

摘要/Abstract

摘要： 针对深度学习模型的效率优化技术是当下人工智能应用领域的研究热点之一。在部署深度学习模型时，可通过降低模型中的算子调度开销与提高算子的执行效率的方式来获得模型效率的提升。该文针对在各时序网络中被频繁使用的双向长短期记忆（Bi-directional Long-Short Term Memory,BiLSTM）结构，基于其结构中包含的正向与逆向的长短期记忆(Long-Short Term Memory,LSTM)细胞的输入可复用特性，同时应用算子融合与张量计算合并技术，提出一种应用于LUNA芯片的Bi-LSTM算子效率优化方法，该方法通过消除冗余操作、进行数据复用与合并张量计算的方法降低时间开销，提升Bi-LSTM算子的执行效率。该算法亦可推广到Bi-RNN及Bi-GRU等时序网络算子中。同时基于边缘端国产LUNA芯片，建立对优化算法验证的实验平台。实验结果表明，应用本文提出的Bi-LSTM算法效率优化方法，能够实现最大37.6%的优化效果。

Abstract: Efficiency optimization for deep learning models remains a key research focus in artificial intelligence applications. When deploying deep learning models, efficiency improvements can be achieved by reducing operator scheduling overhead and improving operator execution efficiency. This paper targets the Bi-directional Long-Short Term Memory (Bi-LSTM) structure widely used in temporal networks. Leveraging the input reuse characteristics between forward and reverse Long-Short Term Memory (LSTM) cells in its architecture, the study proposes an efficiency optimization method for Bi-LSTM operators on LUNA chips through operator fusion and tensor computation consolidation techniques. This method enhances the execution efficiency of the Bi-LSTM operator by eliminating redundant operations, reusing data, and merging tensor computations to reduce time overhead. The algorithm can also extend to other temporal network operators like Bi-RNN and Bi-GRU. An experimental platform is established on the edge heterogeneous computing chip LUNA to validate the optimization algorithm. Test results demonstrate that the proposed Bi-LSTM efficiency optimization approach achieves a maximum optimization improvement of 30%.

王向前, 潘仕伟, 吕亚飞, 项云龙, 景琨. 面向LUNA芯片的Bi-LSTM算子的优化方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252287.

WANG Xiangqian, PAN Shiwei, LV Yafei, XIANG Yunlong, JING ku. Efficiency Optimization Method for Bi-LSTM Targeting LUNA Chips[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252287.

参考文献

[1] Kim Y . Convolutional Neural Networks for Sentence Classification[J]. Eprint Arxiv, 2014.
[2] Elman J L . Finding Structure in Time[J]. Cognitive Science,1990, 14(2):179-211.
[3] Hochreiter S , Schmidhuber J . Long Short-Term Memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[4] RAHMAN S, CHAKRABORTY P.Bangla document classification using deep recurrent neural network with BiLSTM[C]//Proceedings of International Conference on Machine Intelligence and Data Science Applications, 2021.
[5] SAON G, TÜSKE Z, BOLANOS D, et al.Advancing RNN transducer technology for speech recognition[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021.
[6] ORUH J, VIRIRI S, ADEGUN A.Long short-term memory recurrent neural network for automatic speech recognition[J].IEEE Access, 2022,10:30069-30079.
[7] ASHENGO Y A, AGA R T, LEMMA ABEBE S.Context based machine translation with recurrent neural network for English-Amharic translation[J].Machine Translation, 2021,35(1):19-36.
[8] Siami-Namini S, Tavakoli N, Namin A S. The performance of LSTM and BiLSTM in forecasting time series [C]//IEEE International Conference on Big Data (Big Data), 2019: 3285-3292.
[9] 张怡,朱小伶,袁柳,郭晓雷,闫红艳.2024年人工智能芯片技术主要发展分析[J].无人系统技术,2025,8(01):108-116. Zhang Y,Zhu X L,Yuan L,Guo X L,Yan H Y.Main Development Trends of Artificial Intelligence Chip Technology in 2024[J].Unmanned Systems Technology,2025,8(01):108-116.
[10] Paszke, Adam et al.PyTorch: An Imperative Style, High-Pe rformance Deep Learning Library.[C]//Neural Information Pr ocessing Systems (2019). Vancouver, Canada.
[11] 贾景龙,张沈习,李珂,等.基于ARIMA和HO-BiLSTM的变压器监测数据清洗方法[J/OL].高电压技术,1-11[2025-06-12].htt ps://doi.org/10.13336/j.1003-6520.hve.20241594. Jia J L,Zhang S X,Li K,et al.Transformer Monitoring Data Cleaning Method Based on ARIMA and HO-BiLSTM[J/O L].High Voltage Engineering,1-11[2025-06-12].https://doi.org /10.13336/j.1003-6520.hve.20241594.
[12] 徐健,胡博,邢作霞 & 张鹏飞.基于GCN-BiLSTM的非侵入式负荷分解.南方电网技术,1-10. Xv J,Hu B,Xing Z X &Zhang P F.Non-Intrusive Load Disaggregation Based on GCN-BiLSTM.Southern Power System Technology,1-10.
[13] Xuyi Cai, Ying Wang, and Lei Zhang. 2022. Optimus: An Operator Fusion Framework for Deep Neural Networks. A CM Trans. Embed. Comput. Syst. 22, 1, Article 1 (January 2023), 26 pages. https://doi.org/10.1145/3520142
[14] Yu Xing, Shuang Liang, Lingzhi Sui, Zhen Zhang, Jiantao Qiu, Xijie Jia, Xin Liu, Yushun Wang, Yi Shan, and Yu Wang. 2019. DNNVM: End-to-End compiler leveraging ope ration fusion on FPGA-based CNN accelerators. In Proceedi ngs of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays. 187–188. https://doi.org/10.114 5/3289602.3293972
[15] Shixuan Zheng, Xianjue Zhang, Daoli Ou, Shibin Tang, L eibo Liu, Shaojun Wei, and Shouyi Yin. 2020. Efficient sc heduling of irregular network structures on CNN accelerator s. IEEE Trans. Comput.-Aid. Des. Integr. Circ. Syst. 39,11 (2020), 3408–3419. https://doi.org/10.1109/TCAD.2020.3012 215
[16] 高伟,王磊,李嘉楠,等.面向深度学习编译器TVM的算子融合优化[J/OL].计算机科学,1-15[2025-03-24].http://kns.cnki.net/k cms/detail/50.1075.tp.20240625.1611.030.html. Gao W,Wang L,Li Y N,et al.Operator Fusion Optimization for Deep Learning Compiler TVM[J/OL].Computer Scienc e,1-15[2025-03-24].http://kns.cnki.net/kcms/detail/50.1075.tp.2 0240625.1611.030.html.
[17] X. Ding, X. Zhang, N. Ma, J. Han, G. Ding and J. Sun, "RepVGG: Making VGG-style ConvNets Great Again," 202 1 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 2021, pp. 13728-13737, doi: 10.1109/CVPR46437.2021.01352.
[18] 胡玥,高庆狮,刘宏岚."一种优化BITONIC算法:“并行-优化串行”合并和分类向量算法."计算机研究与发展 10(2002):13 07-1316. Hu Y,Gao Q S,Liu H L.A PARALLEL-OPTIMAL-SEQUE NTIAL VECTOR ALGORITHM FOR MERGING AND S ORTIN.Journal of Computer Research and Development 10 (2002):1307-1316.
[19] X. Ding, X. Zhang, J. Han and G. Ding, "Diverse Branch Block: Building a Convolution as an Inception-like Unit," 2021 IEEE/CVF Conference on Computer Vision and Patter n Recognition (CVPR), Nashville, TN, USA, 2021, pp. 108 81-10890, doi: 10.1109/CVPR46437.2021.01074. keywords: {Training;Image segmentation;Costs;Convolution;Semantics;C omputer architecture;Object detection},
[20] 于振华,王向前,吕亚飞.面向可重构AI芯片的编译框架设计 [J].单片机与嵌入式系统应用,2023,23(06):20-2 Yu Z H,Wang X Q,LV Y F.Compilation Framework Desig n for Reconfigurable AI Chip[J].Microcontrollers & Embed ded Systems,2023,23(06):20-2
[21] 宋鹤鸣.智能语音系统加速器设计[D].上海交通大学,2019 Song H M.DESIGN OF ACCELERATOR FOR VOICE IN TELLIGENT SYSTEM[D].Shanghai Jiao Tong University,2 019
[22] Chang,H.;Ye,K. (2025). Research on Custom Algorithms an d Hardware Accelerators Based on RISC-V Vector Extensio ns and Image Processing. Applied and Computational Engin eering,121,1-8.
[23] N. P. Jouppi et al., "In-datacenter performance analysis of a tensor processing unit," 2017 ACM/IEEE 44th Annual Int ernational Symposium on Computer Architecture (ISCA), T oronto, ON, Canada, 2017, pp. 1-12, doi: 10.1145/3079856. 3080246.
[24] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and C hristopher Ré. 2022. FLASHATTENTION: fast and memory-efficient exact attention with IO-awareness. In Proceedings of the 36th International Conference on Neural Informatio n Processing Systems (NIPS '22). Curran Associates Inc., R ed Hook, NY, USA, Article 1189, 16344–16359
[25] Chung, J., Gülçehre, Ç., Cho, K., & Bengio, Y. Empirical Evaluation of Gated Recurrent Neural Networks on Seque nce Modeling[J/OL]. ArXiv(2014), abs/1412.3555.

选择文件类型/文献管理软件名称

选择包含的内容