Multi-model Fusion Speech Wake-up Word Detection Method Based on Ghost-SE-Res2Net

doi:10.19678/j.issn.1000-3428.0067197

Abstract

Abstract:

Speech Wake-up Word Detection(WWD)is a key technology in the field of voice interaction. Choosing an appropriate detection window size significantly affects the performance of WWD. This study proposes a novel multi-model fusion method. By fusing the detection results obtained with small and large detection windows, the WWD performance can be improved. The multi-model fusion method includes two classification models that use small and large detection windows, and both are based on a lightweight SE-Res2Net network, namely, Ghost-SE-Res2Net. The multi-scale mechanism of the Squeeze and Excitation Network(SE-Res2Net) structure significantly improves the WWD performance. In Ghost-SE-Res2Net, first the Ghost convolution is used to replace the ordinary convolution in SE-Res2Net to reduce the model parameter count. Subsequently, an attention pooling layer is used to replace the global average pooling layer to further improve the WWD performance. During detection, the maximum value of the detection results obtained from three consecutive small- detection window models is fused with the detection result obtained from one large-detection window model to determine whether the wake-up word is triggered. In this study, a hard sample mining algorithm is introduced during training to selectively learn difficult-to-detect wake-up word information and improve the classification model detection performance. Accordingly, the system performance is evaluated using the Mobvoi dataset containing two wake-up words. The experimental results show that at 0.5 false alarms per hour, the system achieved false rejection rates of 0.46% and 0.43% for the two wake-up words, respectively. This performance is on par with that of the state-of-the-art baseline, whereas the system's parameter count is 31% smaller than the baseline.

Key words: Wake-up Word Detection(WWD), Ghost block, Res2Net structure, False Rejections(FR), multi-model fusion

摘要：

语音唤醒词检测(WWD)是语音交互中的关键技术，选择合适大小的检测窗对WWD性能的影响很大。提出一种新的多模型融合方法，通过融合小检测窗和大检测窗的检测结果来提高WWD性能。多模型融合方法包含两个分类模型，分别使用小检测窗和大检测窗，均基于轻量化的挤压与激励残差网络(SE-Res2Net)模块，即Ghost-SE-Res2Net，SE-Res2Net结构的多尺度机制可显著提升WWD的能力。在Ghost-SE-Res2Net中，首先使用Ghost卷积替换SE-Res2Net中的普通卷积以降低模型参数量，然后使用注意力池化层替换SE-Res2Net中的全局平均池化层进一步提升WWD能力。在实际检测时融合连续3个小检测窗模型的检测结果的最大值和1个大检测窗模型的检测结果，来判断唤醒词是否被触发。在训练时引入困难样本挖掘算法，选择性地学习较难检测的唤醒词信息以提高分类模型的检测性能。在包含2个唤醒词的Mobvoi数据集上评估系统性能，实验结果表明，在每小时0.5次错误唤醒的情况下，该系统在2个唤醒词上的错误拒绝率分别为0.46%和0.43%，实现了与先进基线相似的性能，并且系统参数量比基线少31%。

关键词: 唤醒词检测, Ghost模块, Res2Net结构, 错误拒绝, 多模型融合

Qiuchen YU, Ruohua ZHOU, Qingsheng YUAN. Multi-model Fusion Speech Wake-up Word Detection Method Based on Ghost-SE-Res2Net[J]. Computer Engineering, 2024, 50(3): 52-59.

虞秋辰, 周若华, 袁庆升. 基于Ghost-SE-Res2Net的多模型融合语音唤醒词检测方法[J]. 计算机工程, 2024, 50(3): 52-59.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0067197

https://www.ecice06.com/EN/Y2024/V50/I3/52

Figures/Tables 12

Fig.1 Multi-model fusion wake-up word detection system

Fig.2 Bottleneck module, Res2Net module, and SE-Res2Net module

Fig.3 Ghost module

Fig.4 Ghost-SE-Res2Net module

Fig.5 Structure of attention pooling layer

Fig.6 The impact of different models on multi-model fusion system

Fig.7 The impact of different pooling methods on multi-model fusion system

References 27

1	SARACLAR M, SPROAT R. Lattice-based search for spoken utterance retrieval[C]//Proceedings of HLT-NAACLʼ04. Washington D. C., USA: IEEE Press, 2004: 129-136.
2	CAN D, SARACLAR M. Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 2011, 19 (8): 2338- 2347. doi: 10.1109/TASL.2011.2134087
3	MAMOU J, RAMABHADRAN B, SIOHAN O. Vocabulary independent spoken term detection[C]//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York, USA: ACM Press, 2007: 615-622.
4	ROSE R C, PAUL D B. A hidden Markov model based keyword recognition system[C]//Proceedings of International Conference on Acoustics, Speech, and Signal Processing. Albuquerque, USA: IEEE Press, 1990: 129-132.
5	SZÖKE I, SCHWARZ P, MATĚJKA P, et al. Phoneme based acoustics keyword spotting in informal continuous speech[M]. Berlin, Germany: Springer, 2005.
6	PANCHAPAGESAN S, SUN M, KHARE A, et al. Multi-task learning and weighted cross-entropy for DNN-based keyword spotting[C]//Proceedings of INTERSPEECHʼ16. San Francisco, USA: [s. n.] 2016: 760-764.
7	SUN M, NAGARAJA V, HOFFMEISTER B, et al. Model shrinking for embedded keyword spotting[C]//Proceedings of the 14th IEEE International Conference on Machine Learning and Applications. Miami, USA: IEEE Press, 2015: 369-374.
8	WU M H, PANCHAPAGESAN S, SUN M, et al. Monophone-based background modeling for two-stage on-device wake word detection[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Washington D. C., USA: IEEE Press, 2018: 5494-5498.
9	CHEN G G, PARADA C, HEIGOLD G. Small-footprint keyword spotting using deep neural networks[C]//Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing. Florence, Italy: IEEE Press, 2014: 4087-4091.
10	SUN M, RAJU A, TUCKER G, et al. Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting[C]//Proceedings of IEEE Spoken Language Technology Workshop. San Diego, USA: IEEE Press, 2016: 474-480.
11	ARIK S Ö, KLIEGL M, CHILD R, et al. Convolutional recurrent neural networks for small-footprint keyword spotting[EB/OL]. [2023-02-10]. https://arxiv.org/abs/1703.05390.
12	LOPEZ-ESPEJO I, TAN Z H, HANSEN J H L, et al. Deep spoken keyword spotting: an overview. IEEE Access, 2022, 10, 4169- 4199. doi: 10.1109/ACCESS.2021.3139508
13	WANG Z M, LI X L, ZHOU J. Small-footprint keyword spotting using deep neural network and connectionist temporal classifier[EB/OL]. [2023-02-10]. https://arxiv.org/abs/1709.03665.
14	杨润延, 程高峰, 刘建. 基于端到端语音识别的关键词检索技术研究. 计算机科学, 2022, 49 (1): 53- 58. URL
	YANG R Y, CHENG G F, LIU J. Study on keyword search framework based on end-to-end automatic speech recognition. Computer Science, 2022, 49 (1): 53- 58. URL
15	WANG Y M, LÜ H, POVEY D, et al. Wake word detection with alignment-free lattice-free MMI[EB/OL]. [2023-02-10]. https://arxiv.org/abs/2005.08347.
16	LI X, LI N, WENG C, et al. Replay and synthetic speech detection with Res2Net architecture[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE Press, 2021: 6354-6358.
17	HAN K, WANG Y H, TIAN Q, et al. GhostNet: more features from cheap operations[C]//Proceedings of IEEE/CVF Conference on Computer Vision and Pattern Recognition. Seattle, USA: IEEE Press, 2020: 1580-1589.
18	GAO S H, CHENG M M, ZHAO K, et al. Res2Net: a new multi-scale backbone architecture. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43 (2): 652- 662. doi: 10.1109/TPAMI.2019.2938758
19	徐梦龙, 张晓雷. 用于语音控制的低资源关键词检索系统. 信号处理, 2020, 36 (6): 879- 884. URL
	XU M L, ZHANG X L. A small footprint keyword spotting system for voice control. Journal of Signal Processing, 2020, 36 (6): 879- 884. URL
20	ZHU Y K, KO T, SNYDER D, et al. Self-attentive speaker embeddings for text-independent speaker verification[EB/OL]. [2023-02-10]. http://www.danielpovey.com/files/2018_interspeech_xvector_attention.pdf.
21	SHRIVASTAVA A, GUPTA A, GIRSHICK R. Training region-based object detectors with online hard example mining[C]//Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas, USA: IEEE Press, 2016: 761-769.
22	MCFEE B, RAFFEL C, LIANG D W, et al. Librosa: audio and music signal analysis in Python[C]//Proceedings of the 14th Python in Science Conference. Austin, USA: [s. n.], 2015: 18-25.
23	PARK D S, CHAN W, ZHANG Y, et al. Specaugment: a simple data augmentation method for automatic speech recognition[EB/OL]. [2023-02-10]. https://arxiv.org/abs/1904.08779.
24	刘作桢, 吴愁, 黎塔, 等. 面向自定义语音唤醒的关键词相关的单通道语音增强. 声学学报, 2023, 48 (2): 415- 424. URL
	LIU Z Z, WU C, LI T, et al. Keyword-dependent monaural speech enhancement for open-vocabulary keyword spotting. Acta Acustica, 2023, 48 (2): 415- 424. URL
25	HOU J Y, SHI Y Y, OSTENDORF M, et al. Mining effective negative training samples for keyword spotting[C]//Proceedings of 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona, Spain: IEEE Press, 2020: 7444-7448.
26	WANG Y M, LÜ H, POVEY D, et al. Wake word detection with streaming transformers[C]//Proceedings of 2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto, Canada: IEEE Press, 2021: 5864-5868.
27	王勇, 张连海. 基于词级DPPM的连续语音关键词检测. 计算机工程, 2014, 40 (5): 247- 251. URL
	WANG Y, ZHANG L H. Continuous speech keyword detection based on word level discriminative point process model. Computer Engineering, 2014, 40 (5): 247- 251. URL

[1]	Xiaobin XU, Yunshuo ZHANG, Fan SHI, Leilei CHANG, Zhigang TAO. Safety Assessment Method Based on Degree of Feature Matching and Fusion of Heterogeneous Sub-Models [J]. Computer Engineering, 2024, 50(8): 113-122.
[2]	Liangshan SHAO, Songze ZHAO. Fractional Imputation Algorithm for Incomplete Data Based on Multi-Model Fusion [J]. Computer Engineering, 2023, 49(9): 79-88, 98.

Please choose a citation manager

Content to export