摘要: 针对语音关键词检索中的集外词问题,提出基于最大互信息-最小描述长度(MMI-MDL)的子词集构建算法。根据子词对的互信息挑选聚合对,通过MDL准则判断是否聚合成新的子词。使用该子词集把单词映射成子词的组合用于检索。实验结果表明,与已有的MDL子词集构建算法相比,由MMI-MDL方法得到的子词集对检索性能有较大提高,在相同精确率指标下,集外词的召回率相对MDL算法提高12.1%。
关键词:
集外词,
语音检索,
子词,
最小描述长度,
最大互信息,
词格网络
Abstract: In order to solve the Out-of-Vocabulary(OOV) problem in speech retrieval tasks, this paper presents a construction algorithm of sub-word units based on Maximum Mutual Information and Minimum Description Length(MMI-MDL). It selects candidate pairs according to the mutual information of sub-word pairs, judges whether combining the pairs to a new sub-word through MDL. After getting the sub-word set, map the word into sub-word for retrieval. Experimental results show that compared with the MDL algorithm, the proposed method has a better performance, and achieves a 12.1% relative improvement on the OOV recall rate.
Key words:
Out-of-Vocabulary(OOV),
speech retrieval,
sub-word,
Minimum Description Length(MDL),
Max Mutual Information(MMI),
word lattice network
中图分类号:
杨乐, 吴及, 吕萍. 语音检索中子词单元的构建算法[J]. 计算机工程, 2012, 38(24): 251-253.
YANG Le, TUN Ji, LV Ping. Construction Algorithm of Sub-word Unit in Speech Retrieval[J]. Computer Engineering, 2012, 38(24): 251-253.