作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (10): 105-111, 119. doi: 10.19678/j.issn.1000-3428.0065685

• 人工智能与模式识别 • 上一篇    下一篇

丢弃冗余块的语音识别Transformer解码加速方法

赵德春1, 舒洋2, 李玲1, 陈欢1, 张子豪2   

  1. 1. 重庆邮电大学 生物信息学院, 重庆 400065
    2. 重庆邮电大学 自动化学院, 重庆 400065
  • 收稿日期:2022-09-05 出版日期:2023-10-15 发布日期:2023-01-06
  • 作者简介:

    赵德春(1975—),男,教授、博士,主研方向为自然语言处理

    舒洋,硕士研究生

    李玲,硕士研究生

    陈欢,硕士研究生

    张子豪,硕士研究生

  • 基金资助:
    重庆市自然科学基金(cstc2019jcyj-msxmX0275); 重庆市研究生科研创新项目(CYS22460)

Speech Recognition Transformer Decoding Acceleration Method with Discarding Redundant Blocks

Dechun ZHAO1, Yang SHU2, Ling LI1, Huan CHEN1, Zihao ZHANG2   

  1. 1. School of Bioinformatics, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
    2. School of Automation, Chongqing University of Posts and Telecommunications, Chongqing 400065, China
  • Received:2022-09-05 Online:2023-10-15 Published:2023-01-06

摘要:

Transformer及其变体因具有强大的上下文建模能力而成为语音识别领域的主流模型,它们能够取得良好的识别结果,但是其中的解码器使用带有冗余信息的全部编码器特征,导致模型的解码速度受到限制。为提高解码器效率,提出一种丢弃冗余空白块的Transformer解码加速方法DRB。以CTC/AED结构作为语音识别基本框架,利用CTC产生的尖峰序列去除编码特征中连续冗余的空白帧,减小编码输出特征的长度,降低解码器的计算量,从而提高模型的解码速度。采用预训练加微调的方式对使用DRB方法的语音识别模型进行训练,以减小因盲目对齐而产生的额外训练开销。引入Intermediate CTC结构提高模型训练时对编码器的约束能力,减小DRB判断冗余帧的误差,降低DRB方法对模型识别精度造成的损失。在开源数据集AISHELL-1与LibriSpeech上进行实验,结果表明,使用DRB的两阶段重打分非自回归解码方法在2个数据集上均能对解码速度取得58%左右的加速效果,且识别精度几乎没有损失,实现了解码效率的显著提升。

关键词: 语音识别, Transformer解码器, CTC模型, 特征压缩, 解码加速

Abstract:

Transformer and its variants have become mainstream models in the field of speech recognition owing to their excellent contextual modeling capabilities. Although they can achieve good recognition results, the decoding speed is limited because the decoder uses all encoder features including redundant information. To improve the efficiency of the decoder, a Transformer decoding acceleration method DRB that entails the discarding of redundant blank blocks is proposed. Using the Connectionist Temporal Classification/Attention-based Encoder-Decoder(CTC/AED) structure as the basic framework for speech recognition, the method uses the peak sequence generated by CTC to remove continuous redundant blank frames from the encoded features, reduce the length of the encoded output features, lower the computational complexity of the decoder, and thus improve the decoding speed of the model. The speech recognition model using DRB method is pre-trained and fine-tuned to reduce the additional training cost caused by blind alignment. Introducing the Intermediate CTC structure improves the constraint ability of the encoder during model training, reduces the error of DRB in judging redundant frames, and reduces the loss of model recognition accuracy. The results of the experiments performed on the open-source datasets AISHELL-1 and LibriSpeech show that, the two-stage rescoring non-autoregressive decoding method using DRB can achieve an acceleration effect of approximately 58% in decoding speed on both datasets, with almost no loss in recognition accuracy. Thus, a significant improvement in decoding efficiency is achieved.

Key words: speech recognition, Transformer decoder, CTC model, feature compression, decoding acceleration