作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (5): 73-80. doi: 10.19678/j.issn.1000-3428.0064687

• 人工智能与模式识别 • 上一篇    下一篇

基于分解门控注意力单元的高效Conformer模型

李宜亭, 屈丹, 杨绪魁, 张昊, 沈小龙   

  1. 中国人民解放军战略支援部队信息工程大学 信息系统工程学院, 郑州 450001
  • 收稿日期:2022-05-13 修回日期:2022-06-21 发布日期:2022-08-31
  • 作者简介:李宜亭(1993-),男,硕士研究生,主研方向为语音识别、模型压缩;屈丹(通信作者),教授、博士;杨绪魁,讲师、博士;张昊,博士研究生;沈小龙,硕士研究生。
  • 基金资助:
    国家自然科学基金(62171470);河南省中原科技创新领军人才项目(234200510019);河南省自然科学基金面上项目(232300421240)。

Efficient Conformer Model Based on Factorized Gated Attention Unit

LI Yiting, QU Dan, YANG Xukui, ZHANG Hao, SHEN Xiaolong   

  1. College of Information Systems Engineering, PLA Strategic Support Force Information Engineering University, Zhengzhou 450001, China
  • Received:2022-05-13 Revised:2022-06-21 Published:2022-08-31

摘要: 为利用有限的存储和计算资源,在保证Conformer端到端语音识别模型精度的前提下,减少模型参数量并加快训练和识别速度,构建一个基于分解门控注意力单元与低秩分解的高效Conformer模型。在前馈和卷积模块中,通过低秩分解进行计算加速,提高Conformer模型的泛化能力。在自注意力模块中,使用分解门控注意力单元降低注意力计算复杂度,同时引入余弦加权机制对门控注意力进行加权保证其向邻近位置集中,提高模型识别精度。在AISHELL-1数据集上的实验结果表明,在引入分解门控注意力单元和余弦编码后,该模型的参数量和语音识别字符错误率(CER)明显降低,尤其当参数量被压缩为Conformer端到端语音识别模型的50%后语音识别CER仅增加了0.34个百分点,并且具有较低的计算复杂度和较高的语音识别精度。

关键词: 端到端语音识别, Conformer模型, 分解门控注意力单元, 模型压缩, 低秩分解

Abstract: To reduce the number of model parameters and accelerate the training and recognition speed while ensuring the accuracy of the Conformer end-to-end speech recognition model,an efficient Conformer model based on Factorized Gated Attention Unit(FGAU) and low rank decomposition is proposed in this study with limited storage and computing resources. In the feedforward and convolution modules,low rank decomposition is used to accelerate the calculation to improve the generalization ability of the Conformer model. In the self-attention module,the FGAU is used to reduce the computational complexity of attention.Meanwhile,cosine weighting mechanism is introduced to ensure that gated attentions are concentrated at the neighboring position to improve the recognition accuracy of the model.Experimental results obtained with the AISHELL-1 dataset indicate that after the introduction of FGAU and cosine coding,the number of parameters and speech recognition character error rate of the proposed model are significantly reduced compared with the number of parameters in the Conformer end-to-end speech recognition model.When the number of parameters is reduced to 50% of that used in the Conformer end-to-end speech recognition model,the speech recognition Character Error Rate(CER) increases by only 0.34 percentage points.This indicates that the proposed model has lower computational complexity and higher speech recognition accuracy.

Key words: end-to-end speech recognition, Conformer model, Factorized Gated Attention Unit(FGAU), model compression, low rank decomposition

中图分类号: