摘要: 介绍了一种基于统计和规则的未登录词识别方法。该方法分为2个步骤:(1)对文本进行分词,对分词结果中的碎片进行全切分生成临时词典,并利用规则和频度信息给临时词典中的每个字串赋权值,利用贪心算法获得每个碎片的最长路径,从而提取未登录词;(2)在上一步骤的基础上,建立二元模型,并结合互信息来提取由若干个词组合而成的未登录词(组)。实验证明该方法开放测试的准确率达到81.25%,召回率达到82.38%。
关键词:
未登录词识别,
贪心算法,
二元模型,
互信息
Abstract: This paper introduces a method to extract unknown Chinese words based on statistic and regulation. The process comprises two parts: (1) It segments the full text and combines the adjacent single Chinese character to short strings (fragments), then uses full-segmentation method to divide each fragment into strings, and each string is assigned a term weighted by rules and frequency. It uses the greedy algorithm to get the longest path of each fragment; every string except single character in this path is an unknown word. (2)It builds a bi-gram model and uses mutual information to combine some adjacent words to unknown words. The precision on the open test sets is 81.25% and recall is 82.38%.
Key words:
Unknown Chinese words recognition,
Greedy algorithm,
Bi-gram model,
Mutual information
中图分类号:
周 蕾;朱巧明. 基于统计和规则的未登录词识别方法研究[J]. 计算机工程, 2007, 33(08): 196-198.
ZHOU Lei ; ZHU Qiaoming. Research on Recognition Method of Unknown Chinese Words
Based on Statistic and Regulation
[J]. Computer Engineering, 2007, 33(08): 196-198.