Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2008, Vol. 34 ›› Issue (22): 43-45. doi: 10.3969/j.issn.1000-3428.2008.22.015

• Software Technology and Database • Previous Articles     Next Articles

Text Segmentation Algorithm Oriented to Small General-text

CHEN Yuan, CHEN Rong, HU Jun-feng, LIN Lin, ZHANG Jing-bo, YU Zhong-hua   

  1. (School of Computer Science, Sichuan University, Chengdu 610064)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-11-20 Published:2008-11-20

面向概括性小文本的文本分割算法

陈 源,陈 蓉,胡俊锋,林 霖,张靖波,于中华   

  1. (四川大学计算机学院,成都 610064)

Abstract: Text segmentation is an important filed in the area of natural language processing. However, there is a defect that the existing models cannot effectively segment small general-text. For the reason, an algorithm based on Hidden Markov Model(HMM) is proposed in this paper. The algorithm segments a small general-text with a single topic into its different aspects of discussion using the length distribution of every structure block and the terms. Two methods are designed for computing symbol emission probabilities of the HMM, one of them is based on sentence group while the other is based on segmentation point. Experiments on Medline abstracts show that the effect of the algorithm proposed is much better than the TextTiling algorithm.

Key words: text segmentation, small general-text, Hidden Markov Model(HMM), boundary recognition, similarity metric

摘要: 文本分割是自然语言文本处理的一项重要研究内容。该文针对现有模型无法有效分割概括性小文本的不足,提出基于隐马尔可夫模型的统计算法。该算法利用小文本中各结构块的长度及词汇信息,对概括性小文本进行同一主题不同论述侧面的分割。对发射概率设计了基于句群和基于分割点2种不同的计算方法。以Medline摘要为样本进行的实验表明,该算法对概括性小文本分割是有效的,明显好于经典的TextTiling算法。

关键词: 文本分割, 概括性小文本, 隐马尔可夫模型, 边界识别, 相似性度量

CLC Number: