作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (3): 52-59. doi: 10.19678/j.issn.1000-3428.0067197

• 人工智能与模式识别 • 上一篇    下一篇

基于Ghost-SE-Res2Net的多模型融合语音唤醒词检测方法

虞秋辰1, 周若华1,*(), 袁庆升2   

  1. 1. 北京建筑大学电气与信息工程学院, 北京 102616
    2. 国家计算机网络应急技术处理协调中心, 北京 100029
  • 收稿日期:2023-03-16 出版日期:2024-03-15 发布日期:2023-07-06
  • 通讯作者: 周若华
  • 基金资助:
    国家自然科学基金(11590774)

Multi-model Fusion Speech Wake-up Word Detection Method Based on Ghost-SE-Res2Net

Qiuchen YU1, Ruohua ZHOU1,*(), Qingsheng YUAN2   

  1. 1. School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 102616, China
    2. National Computer Network Emergency Response Technical Team and Coordination Center, Beijing 100029, China
  • Received:2023-03-16 Online:2024-03-15 Published:2023-07-06
  • Contact: Ruohua ZHOU

摘要:

语音唤醒词检测(WWD)是语音交互中的关键技术,选择合适大小的检测窗对WWD性能的影响很大。提出一种新的多模型融合方法,通过融合小检测窗和大检测窗的检测结果来提高WWD性能。多模型融合方法包含两个分类模型,分别使用小检测窗和大检测窗,均基于轻量化的挤压与激励残差网络(SE-Res2Net)模块,即Ghost-SE-Res2Net,SE-Res2Net结构的多尺度机制可显著提升WWD的能力。在Ghost-SE-Res2Net中,首先使用Ghost卷积替换SE-Res2Net中的普通卷积以降低模型参数量,然后使用注意力池化层替换SE-Res2Net中的全局平均池化层进一步提升WWD能力。在实际检测时融合连续3个小检测窗模型的检测结果的最大值和1个大检测窗模型的检测结果,来判断唤醒词是否被触发。在训练时引入困难样本挖掘算法,选择性地学习较难检测的唤醒词信息以提高分类模型的检测性能。在包含2个唤醒词的Mobvoi数据集上评估系统性能,实验结果表明,在每小时0.5次错误唤醒的情况下,该系统在2个唤醒词上的错误拒绝率分别为0.46%和0.43%,实现了与先进基线相似的性能,并且系统参数量比基线少31%。

关键词: 唤醒词检测, Ghost模块, Res2Net结构, 错误拒绝, 多模型融合

Abstract:

Speech Wake-up Word Detection(WWD)is a key technology in the field of voice interaction. Choosing an appropriate detection window size significantly affects the performance of WWD. This study proposes a novel multi-model fusion method. By fusing the detection results obtained with small and large detection windows, the WWD performance can be improved. The multi-model fusion method includes two classification models that use small and large detection windows, and both are based on a lightweight SE-Res2Net network, namely, Ghost-SE-Res2Net. The multi-scale mechanism of the Squeeze and Excitation Network(SE-Res2Net) structure significantly improves the WWD performance. In Ghost-SE-Res2Net, first the Ghost convolution is used to replace the ordinary convolution in SE-Res2Net to reduce the model parameter count. Subsequently, an attention pooling layer is used to replace the global average pooling layer to further improve the WWD performance. During detection, the maximum value of the detection results obtained from three consecutive small- detection window models is fused with the detection result obtained from one large-detection window model to determine whether the wake-up word is triggered. In this study, a hard sample mining algorithm is introduced during training to selectively learn difficult-to-detect wake-up word information and improve the classification model detection performance. Accordingly, the system performance is evaluated using the Mobvoi dataset containing two wake-up words. The experimental results show that at 0.5 false alarms per hour, the system achieved false rejection rates of 0.46% and 0.43% for the two wake-up words, respectively. This performance is on par with that of the state-of-the-art baseline, whereas the system's parameter count is 31% smaller than the baseline.

Key words: Wake-up Word Detection(WWD), Ghost block, Res2Net structure, False Rejections(FR), multi-model fusion