作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (03): 192-193. doi: 10.3969/j.issn.1000-3428.2007.03.069

• 人工智能及识别技术 • 上一篇    下一篇

基于前后文n-gram模型的古汉语句子切分

陈天莹1,陈 蓉1,潘璐璐1,李红军1,2,于中华1   

  1. (1. 四川大学计算机学院,成都610064;2. 西南科技大学计算机学院,绵阳 621002)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-02-05 发布日期:2007-02-05

Archaic Chinese Punctuating Sentences Based on Context N-gram Model

CHEN Tianying 1, CHEN Rong1, PAN Lulu1, LI Hongjun1,2, YU Zhonghua1   

  1. (1. Dept. of Computer Science, Sichuan University, Chengdu 610064; 2. Dept. of Computer Science, Southwest University of Science and Technology, Mianyang 621002)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-02-05 Published:2007-02-05

摘要: 提出了基于前后文n-gram模型的古汉语句子切分算法,该算法能够在数据稀疏的情况下,通过收集上下文信息,对切分位置进行比较准确的预测,从而较好地处理小规模训练语料的情况,降低数据稀疏对切分准确率的影响。采用《论语》对所提出的算法进行了句子切分实验,达到了81%的召回率和52%的准确率。

关键词: n-gram模型, 数据稀疏, 平滑技术, 基于前后文的n-gram模型

Abstract: An algorithm of punctuating the sentences in archaic Chinese language based on context n-gram model is proposed in the paper. The algorithm can make comparatively accurate prediction of the punctuating-positions of the text under data-sparse instances by collecting and calculating context information to better analyze small-scaled corpus and meanwhile, to bring down the effects of the data-sparse plight on the global accuracy. At last, the paper selects the analects of Confucius ( Lunyu ) to test the algorithm introduced, and the results show that the recall and the precision achieve 81% and 52% respectively.

Key words: N-gram model, Data sparse, Smoothing technology, N-gram model based on context