计算机工程 ›› 2012, Vol. 38 ›› Issue (24): 251-253.doi: 10.3969/j.issn.1000-3428.2012.24.059

• 开发研究与设计技术 • 上一篇    下一篇

语音检索中子词单元的构建算法

杨 乐,吴 及,吕 萍   

  1. (清华大学电子工程系,北京 100084)
  • 收稿日期:2012-03-01 修回日期:2012-04-27 出版日期:2012-12-20 发布日期:2012-12-18
  • 作者简介:杨 乐(1988-),女,硕士研究生,主研方向:语音识别;吴 及,副教授、博士、博士生导师;吕 萍,副研究员、博士
  • 基金项目:

    国家自然科学基金资助项目(61170197);清华大学自主科研计划基金资助项目(2011thz0)

Construction Algorithm of Sub-word Unit in Speech Retrieval

YANG Le, WU Ji, LV Ping   

  1. (Department of Electronic Engineering, Tsinghua University, Beijing 100084, China)
  • Received:2012-03-01 Revised:2012-04-27 Online:2012-12-20 Published:2012-12-18

摘要: 针对语音关键词检索中的集外词问题,提出基于最大互信息-最小描述长度(MMI-MDL)的子词集构建算法。根据子词对的互信息挑选聚合对,通过MDL准则判断是否聚合成新的子词。使用该子词集把单词映射成子词的组合用于检索。实验结果表明,与已有的MDL子词集构建算法相比,由MMI-MDL方法得到的子词集对检索性能有较大提高,在相同精确率指标下,集外词的召回率相对MDL算法提高12.1%。

关键词: 集外词, 语音检索, 子词, 最小描述长度, 最大互信息, 词格网络

Abstract: In order to solve the Out-of-Vocabulary(OOV) problem in speech retrieval tasks, this paper presents a construction algorithm of sub-word units based on Maximum Mutual Information and Minimum Description Length(MMI-MDL). It selects candidate pairs according to the mutual information of sub-word pairs, judges whether combining the pairs to a new sub-word through MDL. After getting the sub-word set, map the word into sub-word for retrieval. Experimental results show that compared with the MDL algorithm, the proposed method has a better performance, and achieves a 12.1% relative improvement on the OOV recall rate.

Key words: Out-of-Vocabulary(OOV), speech retrieval, sub-word, Minimum Description Length(MDL), Max Mutual Information(MMI), word lattice network

中图分类号: