摘要: 提出了基于前后文n-gram模型的古汉语句子切分算法,该算法能够在数据稀疏的情况下,通过收集上下文信息,对切分位置进行比较准确的预测,从而较好地处理小规模训练语料的情况,降低数据稀疏对切分准确率的影响。采用《论语》对所提出的算法进行了句子切分实验,达到了81%的召回率和52%的准确率。
关键词:
n-gram模型,
数据稀疏,
平滑技术,
基于前后文的n-gram模型
Abstract: An algorithm of punctuating the sentences in archaic Chinese language based on context n-gram model is proposed in the paper. The algorithm can make comparatively accurate prediction of the punctuating-positions of the text under data-sparse instances by collecting and calculating context information to better analyze small-scaled corpus and meanwhile, to bring down the effects of the data-sparse plight on the global accuracy. At last, the paper selects the analects of Confucius ( Lunyu ) to test the algorithm introduced, and the results show that the recall and the precision achieve 81% and 52% respectively.
Key words:
N-gram model,
Data sparse,
Smoothing technology,
N-gram model based on context
陈天莹;陈 蓉;潘璐璐;李红军;于中华. 基于前后文n-gram模型的古汉语句子切分[J]. 计算机工程, 2007, 33(03): 192-193.
CHEN Tianying ; CHEN Rong; PAN Lulu; LI Hongjun; YU Zhonghua. Archaic Chinese Punctuating Sentences Based on Context N-gram Model[J]. Computer Engineering, 2007, 33(03): 192-193.