作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (10): 16-18. doi: 10.3969/j.issn.1000-3428.2007.10.006

• 博士论文 • 上一篇    下一篇

基于CRF的百科全书文本段落划分

许 勇1 ,宋 柔2   

  1. (1. 北京工业大学计算机科学学院,北京 100022;2. 北京语言大学计算机科学系,北京 100083)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-05-20 发布日期:2007-05-20

Encyclopedia Text Topic Segmentation Based on CRF

XU Yong1, SONG Rou2   

  1. (1. Institute of Computer Science, Beijing University of Technology, Beijing 100022;
    2. Dept. of Computer Science, Beijing Language and Culture University, Beijing 100083)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-05-20 Published:2007-05-20

摘要: CRF模型是标注、切分序列数据的较新的概率模型,在信息抽取等文本处理领域广受关注。该文介绍了CRF方法,并将其应用到百科全书文本段落的划分上,利用CRF的特征表述机制加入了文本单元序列中的长距离约束,取得了比传统的隐马尔科夫方法更好的结果。

关键词: 文本段落划分, 条件随机域模型, 隐马尔科夫模型

Abstract: Conditional random field(CRF) is a newly proposed probabilistic model for segmenting and labeling sequence data, and has been successfully applied to many natural language processing tasks and information extraction. This paper introduces CRF model and applies it in encyclopedia text topic segmentation. With its long distance overlapping feature mechanism, the CRF model shows better performance than traditional HMM model on encyclopedia text segmentation task.

Key words: Topic segmentation, Conditional random fields(CRF), Hidden Markov model(HMM)

中图分类号: