作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (22): 59-61. doi: 10.3969/j.issn.1000-3428.2009.22.020

• 软件技术与数据库 • 上一篇    下一篇

基于数据区域发现的信息抽取规则生成方法

曲著伟1,2,李敏强1   

  1. (1. 天津大学管理学院,天津 300072;2. 浙江财经学院信息学院,杭州 310018)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-11-20 发布日期:2009-11-20

Information Extraction Rule Generation Method Based on Data Area Discovery

QU Zhu-wei1,2, LI Min-qiang1   

  1. (1. School of Management, Tianjin University, Tianjin 300072;2. Information School, Zhejiang University of Finance & Economics, Hangzhou 310018)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-11-20 Published:2009-11-20

摘要: 提出一种自动检测网页中数据记录结构特点并生成Web信息抽取规则的方法,以网页DOM 树为基础,自动发现和分离Web数据区域所对应的DOM子树,将其分解为数据记录子树集合,综合数据记录子树的结构特点生成抽取规则。实验结果显示,该方法具有较高的抽取准确率和查全率。

关键词: 信息抽取, 抽取规则生成, Web数据区域, 树匹配

Abstract: This paper proposes an automatic method for detecting the structure characteristic of Web data records and generating Web information extraction rules. Based on Web DOM tree, Web data area is identified from Web DOM tree automatically and segmented into data records, and extraction rules are generated by synthesizing the structure of Web data records. Experimental result shows that the method gains high accuracy in terms of recall and precision.

Key words: information extraction, extraction rule generation, Web data area, tree matching

中图分类号: