作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2026, Vol. 52 ›› Issue (1): 105-115. doi: 10.19678/j.issn.1000-3428.0070055

• 计算智能与模式识别 • 上一篇    下一篇

融合最大池化的Conformer中文语音识别

胡从刚1,2, 杨立鹏3, 孙永奇1,2,*(), 陈华龙3, 韩可可3   

  1. 1. 北京交通大学先进轨道交通自主运行全国重点实验室, 北京 100044
    2. 北京交通大学计算机科学与技术学院, 北京 100044
    3. 中国铁道科学研究院集团有限公司, 北京 100081
  • 收稿日期:2024-07-01 修回日期:2024-08-09 出版日期:2026-01-15 发布日期:2026-01-15
  • 通讯作者: 孙永奇
  • 作者简介:

    胡从刚(CCF学生会员), 男, 硕士, 主研方向为语音识别、人工智能

    杨立鹏, 副研究员、博士

    孙永奇(通信作者), 教授、博士、博士生导师

    陈华龙, 助理研究员

    韩可可, 助理研究员

  • 基金资助:
    中央高校基本科研业务费专项资金(2024JBGP008); 新一代人工智能国家科技重大专项(2021ZD0113002)

Chinese Speech Recognition Using Conformer Fused with Max Pooling

HU Conggang1,2, YANG Lipeng3, SUN Yongqi1,2,*(), CHEN Hualong3, HAN Keke3   

  1. 1. State Key Laboratory of Advanced Rail Autonomous Operation, Beijing Jiaotong University, Beijing 100044, China
    2. School of Computer Science and Technology, Beijing Jiaotong University, Beijing 100044, China
    3. China Academy of Railway Sciences Group Co., Ltd., Beijing 100081, China
  • Received:2024-07-01 Revised:2024-08-09 Online:2026-01-15 Published:2026-01-15
  • Contact: SUN Yongqi

摘要:

语音识别旨在通过先进的算法与信号处理技术, 赋予机器理解人类语音的能力, 使得人与机器之间的交流更加便捷、顺畅。目前, 大多数端到端语音识别的研究工作主要围绕Conformer模型进行优化。针对Conformer编码器对语音细粒度局部特征提取能力不足的问题, 提出一种融合最大池化(MP)的Conformer中文语音识别模型。首先, 将编码器卷积模块中门控线性单元的输出在时间维度上进行MP, 以提取多帧语音信号对应一个字符的细粒度局部特征。然后, 将池化后的特征与逐通道卷积(DWC)提取的粗粒度局部特征以逐元素相加的方式进行融合, 以增加语音局部特征的信息量, 从而提高Conformer模型的语音识别准确率。最后, 在公开的中文数据集Aishell-1上的实验结果表明: 采用贪心搜索方式进行解码, 所提模型可以将基线模型的字错误率(CER)从5.58%降低至5.32%;采用注意力重打分方式进行解码, 所提模型可以将基线模型的CER从5.06%降低至4.92%。

关键词: 语音识别, 细粒度局部特征, Conformer模型, 最大池化, 逐通道卷积

Abstract:

Speech recognition technology enables machines to understand human speech using advanced algorithms and signal processing technologies, thereby making communication between humans and machines more convenient. Most existing studies on end-to-end speech recognition focus on optimizing the Conformer model. The Conformer encoder suffers from the issue of insufficient extraction of fine-grained local speech features. To resolve these issues, this study proposes a Chinese speech recognition method based on Max Pooling (MP). First, the output of the gated linear unit in the convolutional module of the encoder is max-pooled along the time dimension to extract fine-grained local features corresponding to the characteristics of multiple speech signal frames. Second, these pooled features are fused with the coarse-grained local features extracted via Depthwise Convolution (DWC) using the element-wise sum method to increase the amount of information on local speech features and improve the speech recognition accuracy of the Conformer model. The experimental results on the public Chinese dataset Aishell-1 show that the improved model can reduce the Character Error Rate (CER) of the baseline model from 5.58% to 5.32% and from 5.06% to 4.92% by decoding using greedy search and attention rescoring, respectively.

Key words: speech recognition, fine-grained local feature, Conformer model, Max Pooling (MP), Depthwise Convolution (DWC)