基于分解门控注意力单元的高效Conformer模型

doi:10.19678/j.issn.1000-3428.0064687

摘要/Abstract

摘要： 为利用有限的存储和计算资源，在保证Conformer端到端语音识别模型精度的前提下，减少模型参数量并加快训练和识别速度，构建一个基于分解门控注意力单元与低秩分解的高效Conformer模型。在前馈和卷积模块中，通过低秩分解进行计算加速，提高Conformer模型的泛化能力。在自注意力模块中，使用分解门控注意力单元降低注意力计算复杂度，同时引入余弦加权机制对门控注意力进行加权保证其向邻近位置集中，提高模型识别精度。在AISHELL-1数据集上的实验结果表明，在引入分解门控注意力单元和余弦编码后，该模型的参数量和语音识别字符错误率(CER)明显降低，尤其当参数量被压缩为Conformer端到端语音识别模型的50%后语音识别CER仅增加了0.34个百分点，并且具有较低的计算复杂度和较高的语音识别精度。

关键词: 端到端语音识别, Conformer模型, 分解门控注意力单元, 模型压缩, 低秩分解

Abstract: To reduce the number of model parameters and accelerate the training and recognition speed while ensuring the accuracy of the Conformer end-to-end speech recognition model，an efficient Conformer model based on Factorized Gated Attention Unit（FGAU） and low rank decomposition is proposed in this study with limited storage and computing resources. In the feedforward and convolution modules，low rank decomposition is used to accelerate the calculation to improve the generalization ability of the Conformer model. In the self-attention module，the FGAU is used to reduce the computational complexity of attention.Meanwhile，cosine weighting mechanism is introduced to ensure that gated attentions are concentrated at the neighboring position to improve the recognition accuracy of the model.Experimental results obtained with the AISHELL-1 dataset indicate that after the introduction of FGAU and cosine coding，the number of parameters and speech recognition character error rate of the proposed model are significantly reduced compared with the number of parameters in the Conformer end-to-end speech recognition model.When the number of parameters is reduced to 50% of that used in the Conformer end-to-end speech recognition model，the speech recognition Character Error Rate（CER） increases by only 0.34 percentage points.This indicates that the proposed model has lower computational complexity and higher speech recognition accuracy.

Key words: end-to-end speech recognition, Conformer model, Factorized Gated Attention Unit（FGAU）, model compression, low rank decomposition

中图分类号:

TP391

李宜亭, 屈丹, 杨绪魁, 张昊, 沈小龙. 基于分解门控注意力单元的高效Conformer模型[J]. 计算机工程, 2023, 49(5): 73-80.

LI Yiting, QU Dan, YANG Xukui, ZHANG Hao, SHEN Xiaolong. Efficient Conformer Model Based on Factorized Gated Attention Unit[J]. Computer Engineering, 2023, 49(5): 73-80.

https://www.ecice06.com/CN/Y2023/V49/I5/73

图/表 12

20230515185044

20230515185048

20230515185051

20230515185055

20230515185059

20230515185103

20230515185106

20230515185109

20230515185112

20230515185116

20230515185119

20230515185123

参考文献

[1] VASWANI A,SHAZEER N,PARMAR N,et al.Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2017:6000-6010.
[2] JAIN A,ROUHE A K,GRÖNROOS S A,et al.Finnish ASR with deep Transformer models[EB/OL].[2022-04-12].https://arxiv.org/abs/2003.11562v1.
[3] DONG L H,XU S,XU B.Speech-Transformer:a no-recurrence sequence-to-sequence model for speech recognition[C]//Proceedings of 2018 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2018:5884-5888.
[4] GULATI A,QIN J,CHIU C C,et al.Conformer:convolution-augmented transformer for speech recognition[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2020:3015-3025.
[5] WINATA G I,CAHYAWIJAYA S,LIN Z J,et al.Lightweight and efficient end-to-end speech recognition using low-rank Transformer[C]//Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2020:6144-6148.
[6] WANG X,SUN S N,XIE L,et al.Efficient Conformer with prob-sparse attention mechanism for end-to-end speech recognition[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2021:1898-1902.
[7] BURCHI M,VIELZEUF V.Efficient Conformer:progressive downsampling and grouped attention for automatic speech recognition[EB/OL].[2022-04-12].https://arxiv.org/abs/2109.01163.
[8] CHANG H J,YANG S,LEE H Y.DistilHuBERT:speech representation learning by layer-wise distillation of hidden-unit BERT[EB/OL].[2022-04-12].https://arxiv.org/abs/2110.01900.
[9] LÜ Y J,WANG L B,GE M,et al.Compressing Transformer-based ASR model by task-driven loss and attention-based multi-level feature distillation[C]//Proceedings of IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2022:7992-7996.
[10] LIN Z,LIU J,YANG Z,et al.Pruning redundant mappings in transformer models via spectral-normalized identity prior[C]//Proceedings of EMNLP'20.Stroudsburg,USA:Association for Computational Linguistics,2020:719-730.
[11] QIN Z,SUN W,DENG H,et al.cosFormer:rethinking Softmax in attention[EB/OL].[2022-04-12].https://arxiv.org/abs/2202.08791.
[12] ZHOU S Y,DONG L H,XU S,et al.A comparison of modeling units in sequence-to-sequence speech recognition with the transformer on mandarin Chinese[M].Berlin,Germany:Springer,2018.
[13] ZHOU S Y,DONG L H,XU S,et al.Syllable-based sequence-to-sequence speech recognition with the transformer in mandarin Chinese[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2018:791-795.
[14] LI S,XU M,ZHANG X.Conformer-based end-to-end speech recognition with rotary position embedding[EB/OL].[2022-04-12].https://arxiv.org/abs/2107.05907.
[15] LIN X F,ZHAO C,PAN W.Towards accurate binary convolutional neural network[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2017:344-352.
[16] TULLOCH A,JIA Y.High performance ultra-low-precision convolutions on mobile devices[EB/OL].[2022-04-12].https://arxiv.org/abs/1712.02427.
[17] LUO P,ZHU Z Y,LIU Z W,et al.Face model compression by distilling knowledge from neurons[C]//Proceedings of the 13th AAAI Conference on Artificial Intelligence.New York,USA:ACM Press,2016:3560-3566.
[18] NOVIKOV A,PODOPRIKHIN D,OSOKIN A,et al.Tensorizing neural networks[C]//Proceedings of the 28th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2015:442-450.
[19] KRIMAN S,BELIAEV S,GINSBURG B,et al.Quartznet:deep automatic speech recognition with 1D time-channel separable convolutions[C]//Proceedings of 2020 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2020:6124-6128.
[20] MEHROTRA A,DUDZIAK Ł,YEO J,et al.Iterative compression of end-to-end ASR model using AutoML[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2020:3361-3365.
[21] LI S,RAJ D,LU X G,et al.Improving Transformer-based speech recognition systems with compressed structure and speech attributes augmentation[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2019:4400-4404.
[22] MORI T,TJANDRA A,SAKTI S,et al.Compressing end-to-end ASR networks by tensor-train decomposition[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2018:806-810.
[23] KHODAK M,TENENHOLTZ N,MACKEY L,et al.Initialization and regularization of factorized neural layers[EB/OL].[2022-04-12].https://arxiv.org/abs/2105. 01029.
[24] HUA W,DAI Z,LIU H,et al.Transformer quality in linear time[EB/OL].[2022-04-12].https://arxiv.org/abs/2202. 10447.
[25] ZAHEER M,GURUGANESH G,DUBEY A,et al.Big bird:Transformers for longer sequences[EB/OL].[2022-04-12].https://arxiv.org/abs/2007.14062.
[26] KITAEV N,KAISER Ł,LEVSKAYA A.Reformer:the efficient Transformer[EB/OL].[2022-04-12].https://arxiv.org/abs/2001.04451.
[27] TAY Y,BAHRI D,METZLER D,et al.Synthesizer:rethinking self-attention in Transformer models[EB/OL].[2022-04-12].https://arxiv.org/abs/2005.00743.
[28] XU M L,LI S Q,ZHANG X L.Transformer-based end-to-end speech recognition with local dense synthesizer attention[C]//Proceedings of 2021 IEEE International Conference on Acoustics,Speech and Signal Processing.Washington D.C.,USA:IEEE Press,2021:5899-5903.
[29] WU Z,LIU Z,LIN J,et al.Lite transformer with long-short range attention[EB/OL].[2022-04-12].https://arxiv.org/abs/2004.11886.
[30] TITSIAS M K.One-vs-each approximation to softmax for scalable estimation of probabilities[C]//Proceedings of the 30th International Conference on Neural Information Processing Systems.New York,USA:ACM Press,2016:4168-4176.
[31] PARK D S,CHAN W,ZHANG Y,et al.SpecAugment:a simple data augmentation method for automatic speech recognition[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2019:2613-2617.
[32] LIU Y,LI T,ZHANG P,et al.Improved Conformer-based end-to-end speech recognition using neural architecture search[EB/OL].[2022-04-12].https://arxiv.org/abs/2104.05390.
[33] YAO Z Y,WU D,WANG X,et al.WeNet:production oriented streaming and non-streaming end-to-end speech recognition toolkit[C]//Proceedings of Annual Conference of the International Speech Communication Association.Washington D.C.,USA:IEEE Press,2021:2093-2097.

选择文件类型/文献管理软件名称

选择包含的内容