摘要: 针对Web信息抽取领域中存在的“项缺失”和“项无序”问题,提出一种基于隐马尔可夫模型的Web信息抽取方法。将Web文档解析为一棵扩展的DOM树,映射待抽取的信息项为状态,映射待抽取的信息项在扩展DOM树中的路径为词汇,使用归纳算法构造隐马尔可夫模型。实验结果证明该方法可以获得更好的抽取性能。
关键词:
信息抽取,
隐马尔可夫模型,
扩展DOM树
Abstract: To solve disorder among information items and lack of information item in the field of information extraction, this paper proposes a Web information extraction algorithm based on Hidden Markov Model(HMM). It parses a Web document into an extended DOM tree, and maps an information item to a state with mapping a path in extended DOM tree about an information item to a vocable. An HMM model is obtained by using induction algorithm. Experiments show that the algorithm has better extraction performance.
Key words:
information extraction,
Hidden Markov Model(HMM),
extended DOM tree
中图分类号:
刘亚清;陈 荣. 基于隐马尔可夫模型的Web信息抽取[J]. 计算机工程, 2009, 35(18): 25-27.
LIU Ya-qing; CHEN Rong. Web Information Extraction Based on Hidden Markov Model[J]. Computer Engineering, 2009, 35(18): 25-27.