Abstract:
This paper proposes a model of Chinese words segmentation and part-of-word tagging. In the words segmentation stage, the top N segmentation results are confirmed as the candidate. The final result among these candidates is gotten after unknown words recognition and part-of- word tagging. A Chinese lexical analyzer is developed. This model with different size of training set is tested. The lexical analyzer’s accuracy of words segmentation and part-of-word is 98.34% and 96.07%. This proves the effectiveness of the method.
Key words:
words segmentation,
part-of-word tagging,
shortest path
摘要: 构造一种中文分词和词性标注的模型,在分词阶段确定N个最佳结果作为候选集,通过未登录词识别和词性标注,从候选结果集中选优得到最终结果,并基于该模型实现一个中文自动分词和词性自动标注的中文词法分析器。经不同大小训练集下的测试证明,该分析器的分词准确率和词性标注准确率分别达到98.34%和96.07%,证明了该方法的有效性。
关键词:
分词,
词性标注,
最短路径
CLC Number:
LIU Yao-feng; WANG Zhi-liang; WANG Chuan-jing. Model of Chinese Words Segmentation and Part-of-Word Tagging[J]. Computer Engineering, 2010, 36(4): 17-19.
刘遥峰;王志良;王传经. 中文分词和词性标注模型[J]. 计算机工程, 2010, 36(4): 17-19.