丢弃冗余块的语音识别Transformer解码加速方法

doi:10.19678/j.issn.1000-3428.0065685

摘要/Abstract

摘要：

Transformer及其变体因具有强大的上下文建模能力而成为语音识别领域的主流模型，它们能够取得良好的识别结果，但是其中的解码器使用带有冗余信息的全部编码器特征，导致模型的解码速度受到限制。为提高解码器效率，提出一种丢弃冗余空白块的Transformer解码加速方法DRB。以CTC/AED结构作为语音识别基本框架，利用CTC产生的尖峰序列去除编码特征中连续冗余的空白帧，减小编码输出特征的长度，降低解码器的计算量，从而提高模型的解码速度。采用预训练加微调的方式对使用DRB方法的语音识别模型进行训练，以减小因盲目对齐而产生的额外训练开销。引入Intermediate CTC结构提高模型训练时对编码器的约束能力，减小DRB判断冗余帧的误差，降低DRB方法对模型识别精度造成的损失。在开源数据集AISHELL-1与LibriSpeech上进行实验，结果表明，使用DRB的两阶段重打分非自回归解码方法在2个数据集上均能对解码速度取得58%左右的加速效果，且识别精度几乎没有损失，实现了解码效率的显著提升。

关键词: 语音识别, Transformer解码器, CTC模型, 特征压缩, 解码加速

Abstract:

Transformer and its variants have become mainstream models in the field of speech recognition owing to their excellent contextual modeling capabilities. Although they can achieve good recognition results, the decoding speed is limited because the decoder uses all encoder features including redundant information. To improve the efficiency of the decoder, a Transformer decoding acceleration method DRB that entails the discarding of redundant blank blocks is proposed. Using the Connectionist Temporal Classification/Attention-based Encoder-Decoder(CTC/AED) structure as the basic framework for speech recognition, the method uses the peak sequence generated by CTC to remove continuous redundant blank frames from the encoded features, reduce the length of the encoded output features, lower the computational complexity of the decoder, and thus improve the decoding speed of the model. The speech recognition model using DRB method is pre-trained and fine-tuned to reduce the additional training cost caused by blind alignment. Introducing the Intermediate CTC structure improves the constraint ability of the encoder during model training, reduces the error of DRB in judging redundant frames, and reduces the loss of model recognition accuracy. The results of the experiments performed on the open-source datasets AISHELL-1 and LibriSpeech show that, the two-stage rescoring non-autoregressive decoding method using DRB can achieve an acceleration effect of approximately 58% in decoding speed on both datasets, with almost no loss in recognition accuracy. Thus, a significant improvement in decoding efficiency is achieved.

Key words: speech recognition, Transformer decoder, CTC model, feature compression, decoding acceleration

赵德春, 舒洋, 李玲, 陈欢, 张子豪. 丢弃冗余块的语音识别Transformer解码加速方法[J]. 计算机工程, 2023, 49(10): 105-111, 119.

Dechun ZHAO, Yang SHU, Ling LI, Huan CHEN, Zihao ZHANG. Speech Recognition Transformer Decoding Acceleration Method with Discarding Redundant Blocks[J]. Computer Engineering, 2023, 49(10): 105-111, 119.

http://www.ecice06.com/CN/Y2023/V49/I10/105

图/表 9

图1 CTC尖峰现象示意图

Fig.1 Schematic diagram of the CTC spike phenomenon

图2 DRB方法流程

Fig.2 Procedure of the DRB method

图3 使用DRB方法的Conformer模型结构

Fig.3 Conformer model structure using DRB method

参考文献 28

1	GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks[C]//Proceedings of the 23rd International Conference on Machine Learning. New York, USA: ACM Press, 2006: 369-376.
2	吕浩田, 马志强, 王洪彬, 等. 基于CNN-CTC的蒙古语层迁移语音识别模型. 中文信息学报, 2022, 36 (6): 52- 60. URL
	LÜ H T, MA Z Q, WANG H B, et al. CNN-CTC based layer transfer model for Mongolian speech recognition. Journal of Chinese Information Processing, 2022, 36 (6): 52- 60. URL
3	GRAVES A, MOHAMED A R, HINTON G. Speech recognition with deep recurrent neural networks[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2013: 6645-6649.
4	LI J Y, ZHAO R, HU H, et al. Improving RNN transducer modeling for end-to-end speech recognition[C]//Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. Washington D. C., USA: IEEE Press, 2020: 114-121.
5	CHAN W, JAITLY N, LE Q, et al. Listen, attend and spell: a neural network for large vocabulary conversational speech recognition[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2016: 4960-4964.
6	WATANABE S, HORI T, KIM S, et al. Hybrid CTC/Attention architecture for end-to-end speech recognition. IEEE Journal of Selected Topics in Signal Processing, 2017, 11 (8): 1240- 1253. doi: 10.1109/JSTSP.2017.2763455
7	马夺. 基于LAS模型的中英文混杂语音识别研究[D]. 兰州: 西北民族大学, 2020.
	MA D. Research on Chinese-English mixed speech recognition based on LAS model[D]. Lanzhou: Northwest University for Nationalities, 2020. (in Chinese)
8	LI J Y. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 2022, 11 (1): 1- 27.
9	ZHANG B, WU D, YAO Z, et al. Unified streaming and non-streaming two-pass end-to-end model for speech recognition[EB/OL]. [2022-08-05]. https://arxiv.org/abs/2012.05481.
10	MIAO H R, CHENG G F, ZHANG P Y, et al. Online hybrid CTC/Attention end-to-end automatic speech recognition architecture. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28, 1452- 1465. doi: 10.1109/TASLP.2020.2987752
11	VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2017: 6000-6010.
12	DONG L H, XU S, XU B. Speech-Transformer: a no-recurrence sequence-to-sequence model for speech recognition[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2018: 5884-5888.
13	SONG X C, WU Z Y, HUANG Y H, et al. Non-autoregressive Transformer ASR with CTC-enhanced decoder input[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2021: 5894-5898.
14	WANG N J C, QUAN Z, WANG S, et al. Adding connectionist temporal summarization into Conformer to improve its decoder efficiency for speech recognition[EB/OL]. [2022-08-05]. https://arxiv.org/abs/2204.03889.
15	WANG Y H, LEE H Y, LEE L S. Segmental audio Word2Vec: representing utterances as sequences of vectors with applications in spoken term detection[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2018: 6269-6273.
16	YAO Z Y, WU D, WANG X, et al. WeNet: production oriented streaming and non-streaming end-to-end speech recognition toolkit[EB/OL]. [2022-08-05]. https://arxiv.org/pdf/2102.01547.pdf.
17	BU H, DU J Y, NA X Y, et al. AISHELL-1: an open-source Mandarin speech corpus and a speech recognition baseline[EB/OL]. [2022-08-05]. https://arxiv.org/pdf/1709.05522.pdf.
18	LEE J, WATANABE S. Intermediate loss regularization for CTC-based speech recognition[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2021: 6224-6228.
19	陈戈, 谢旭康, 孙俊, 等. 使用Conformer增强的混合CTC/Attention端到端中文语音识别. 计算机工程与应用, 2023, 59 (4): 97- 103. URL
	CHEN G, XIE X K, SUN J, et al. Hybrid CTC/Attention end-to-end Chinese speech recognition enhanced by Conformer. Computer Engineering and Applications, 2023, 59 (4): 97- 103. URL
20	GULATI A, QIN J, CHIU C C, et al. Conformer: convolution-augmented Transformer for speech recognition[EB/OL]. [2022-08-05]. https://arxiv.org/abs/2005.08100.
21	O'MALLEY T, NARAYANAN A, WANG Q, et al. A Conformer-based ASR frontend for joint acoustic echo cancellation, speech enhancement and speech separation[C]//Proceedings of IEEE Automatic Speech Recognition and Understanding Workshop. Washington D. C., USA: IEEE Press, 2022: 304-311.
22	POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit[C]//Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition and Understanding. Washington D. C., USA: IEEE Press, 2011: 15-22.
23	WATANABE S, HORI T, KARITA S, et al. ESPnet: end-to-end speech processing toolkit[C]//Proceedings of ISCA'18. Washington D. C., USA: IEEE Press, 2018: 2207-2211.
24	PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: a simple data augmentation method for automatic speech recognition[C]//Proceedings of ISCA'19. Washington D. C., USA: IEEE Press, 2019: 2613-2617.
25	FAN R C, CHU W, CHANG P, et al. An improved single step non-autoregressive Transformer for automatic speech recognition[C]//Proceedings of ISCA'21. Washington D. C., USA: IEEE Press, 2021: 3715-3719.
26	BAI Y, YI J Y, TAO J H, et al. Fast end-to-end speech recognition via non-autoregressive models and cross-modal knowledge transferring from BERT. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29, 1897- 1911. doi: 10.1109/TASLP.2021.3082299
27	GAO Z F, ZHANG S L, MCLOUGHLIN I, et al. Paraformer: fast and accurate parallel Transformer for non-autoregressive end-to-end speech recognition[C]//Proceedings of ISCA'22. Washington D. C., USA: IEEE Press, 2022: 2063-2067.
28	WANG Y, LIU R, BAO F, et al. Alignment-learning based single-step decoding for accurate and fast non-autoregressive speech recognition[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2022: 8292-8296.

[1]	李宜亭, 屈丹, 杨绪魁, 张昊, 沈小龙. 基于分解门控注意力单元的高效Conformer模型[J]. 计算机工程, 2023, 49(5): 73-80.
[2]	柏财通, 崔翛龙, 李爱. 基于本地蒸馏联邦学习的鲁棒语音识别技术[J]. 计算机工程, 2022, 48(10): 103-109.
[3]	柏财通, 高志强, 李爱, 崔翛龙. 基于门控网络的军事装备控制指令语音识别研究[J]. 计算机工程, 2021, 47(7): 301-306.
[4]	王俊超,黄浩,徐海华,胡英. 基于迁移学习的低资源度维吾尔语语音识别[J]. 计算机工程, 2018, 44(10): 281-285,291.
[5]	胡文君,傅美君,潘文林. 基于Kaldi的普米语语音识别[J]. 计算机工程, 2018, 44(1): 199-205.
[6]	张乐,张雪英,孙颖,张卫. 基于聚合经验模态分解的情感语音特征提取[J]. 计算机工程, 2017, 43(8): 306-309,315.
[7]	项秉伟,景新幸,杨海燕. 基于噪声分类与补偿的车载语音识别[J]. 计算机工程, 2017, 43(3): 220-224.
[8]	商雄伟,张志祥,邱舒婷. 一种通用的限定领域智能语音导学系统设计方法[J]. 计算机工程, 2016, 42(6): 299-304.
[9]	赵彩光,张树群,雷兆宜. 基于改进对比散度的GRBM 语音识别[J]. 计算机工程, 2015, 41(5): 213-218.
[10]	鲜晓东,吕建中,樊宇星. 基于密度与距离参数的CHMM声学模型初值估计[J]. 计算机工程, 2015, 41(10): 318-321.
[11]	张震，赵庆卫，颜永红. 基于语音识别与特征的无监督语音模式提取[J]. 计算机工程, 2014, 40(5): 262-265.
[12]	姜慧，周霆. EEG信号动态演化过程的研究[J]. 计算机工程, 2013, 39(9): 206-209,213.
[13]	袁浩, 李海洋, 郑铁然, 韩纪庆. 基于相邻帧特征相似性的快速关键词检出方法[J]. 计算机工程, 2012, 38(7): 287-289.
[14]	李冠宇, 孟猛. 藏语拉萨话大词表连续语音识别声学模型研究[J]. 计算机工程, 2012, 38(5): 189-191.
[15]	秦春香, 黄浩. 发音特征在维汉语音识别中的应用[J]. 计算机工程, 2012, 38(23): 177-180.

选择文件类型/文献管理软件名称

选择包含的内容