作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (08): 196-198. doi: 10.3969/j.issn.1000-3428.2007.08.069

• 人工智能及识别技术 • 上一篇    下一篇

基于统计和规则的未登录词识别方法研究

周 蕾1,朱巧明2   

  1. (1. 常熟理工学院计算机科学与工程系,常熟 215500;2. 苏州大学计算机科学和技术学院,苏州 215006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-04-20 发布日期:2007-04-20

Research on Recognition Method of Unknown Chinese Words
Based on Statistic and Regulation

ZHOU Lei 1, ZHU Qiaoming 2   

  1. (1. Department of Computer Science and Engineering, Changshu Institute of Technology, Changshu 215500; 2. School of Computer Science and Technology, Suzhou University, Suzhou 215006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-04-20 Published:2007-04-20

摘要: 介绍了一种基于统计和规则的未登录词识别方法。该方法分为2个步骤:(1)对文本进行分词,对分词结果中的碎片进行全切分生成临时词典,并利用规则和频度信息给临时词典中的每个字串赋权值,利用贪心算法获得每个碎片的最长路径,从而提取未登录词;(2)在上一步骤的基础上,建立二元模型,并结合互信息来提取由若干个词组合而成的未登录词(组)。实验证明该方法开放测试的准确率达到81.25%,召回率达到82.38%。

关键词: 未登录词识别, 贪心算法, 二元模型, 互信息

Abstract: This paper introduces a method to extract unknown Chinese words based on statistic and regulation. The process comprises two parts: (1) It segments the full text and combines the adjacent single Chinese character to short strings (fragments), then uses full-segmentation method to divide each fragment into strings, and each string is assigned a term weighted by rules and frequency. It uses the greedy algorithm to get the longest path of each fragment; every string except single character in this path is an unknown word. (2)It builds a bi-gram model and uses mutual information to combine some adjacent words to unknown words. The precision on the open test sets is 81.25% and recall is 82.38%.

Key words: Unknown Chinese words recognition, Greedy algorithm, Bi-gram model, Mutual information

中图分类号: