计算机工程 ›› 2010, Vol. 36 ›› Issue (06): 24-26.doi: 10.3969/j.issn.1000-3428.2010.06.008

• 博士论文 • 上一篇    下一篇

基于最大熵的维吾尔语句子边界识别模型

艾山•吾买尔,吐尔根•依步拉音   

  1. (新疆大学信息科学与工程学院,乌鲁木齐 830046)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-03-20 发布日期:2010-03-20

Uyghur Sentence Boundary Identification Model Based on Maximum Entropy

Aishan WUMAIER, Tuergen YIBULAYIN   

  1. (College of Information Science & Engineering, Xinjiang University, Urumqi 830046)
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-03-20 Published:2010-03-20

摘要: 采用最大熵模型实现维吾尔语句子边界识别,该模型的训练过程不需要提供手工收集规则、词性标注及形态分析,仅使用较容易获取的单词长度和音节等特征。为确定最佳特征模板,在特征空间上组合出不同特征模板进行测试。实验结果表明,最佳特征模板具有较强的鲁棒性,召回率可达97.72%。

关键词: 维吾尔语, 句子边界识别, 特征选择, 最大熵

Abstract: The Maximum Entropy(ME) model is used to detect Uyghur sentence boundary. The training procedure does not require hand-crafted rules, part-of-speech tags, or morphological information, but uses readily available features, such as word length and syllable. To determine the best feature set, tests are performed on the different combinations of features. Experimental results show the best feature set is quite robust, and achieves recall of 97.72%.

Key words: Uyghur, sentence boundary identification, feature selection, Maximum Entropy(ME)

中图分类号: