作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (18): 25-27. doi: 10.3969/j.issn.1000-3428.2009.18.009

• 博士论文 • 上一篇    下一篇

基于隐马尔可夫模型的Web信息抽取

刘亚清,陈 荣

  

  1. (大连海事大学信息科学技术学院,大连 116026)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-09-20 发布日期:2009-09-20

Web Information Extraction Based on Hidden Markov Model

LIU Ya-qing, CHEN Rong   

  1. (Institute of Information Science and Technology, Dalian Maritime University, Dalian 116026)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-09-20 Published:2009-09-20

摘要: 针对Web信息抽取领域中存在的“项缺失”和“项无序”问题,提出一种基于隐马尔可夫模型的Web信息抽取方法。将Web文档解析为一棵扩展的DOM树,映射待抽取的信息项为状态,映射待抽取的信息项在扩展DOM树中的路径为词汇,使用归纳算法构造隐马尔可夫模型。实验结果证明该方法可以获得更好的抽取性能。

关键词: 信息抽取, 隐马尔可夫模型, 扩展DOM树

Abstract: To solve disorder among information items and lack of information item in the field of information extraction, this paper proposes a Web information extraction algorithm based on Hidden Markov Model(HMM). It parses a Web document into an extended DOM tree, and maps an information item to a state with mapping a path in extended DOM tree about an information item to a vocable. An HMM model is obtained by using induction algorithm. Experiments show that the algorithm has better extraction performance.

Key words: information extraction, Hidden Markov Model(HMM), extended DOM tree

中图分类号: