摘要: 提出一种自动检测网页中数据记录结构特点并生成Web信息抽取规则的方法,以网页DOM 树为基础,自动发现和分离Web数据区域所对应的DOM子树,将其分解为数据记录子树集合,综合数据记录子树的结构特点生成抽取规则。实验结果显示,该方法具有较高的抽取准确率和查全率。
关键词:
信息抽取,
抽取规则生成,
Web数据区域,
树匹配
Abstract: This paper proposes an automatic method for detecting the structure characteristic of Web data records and generating Web information extraction rules. Based on Web DOM tree, Web data area is identified from Web DOM tree automatically and segmented into data records, and extraction rules are generated by synthesizing the structure of Web data records. Experimental result shows that the method gains high accuracy in terms of recall and precision.
Key words:
information extraction,
extraction rule generation,
Web data area,
tree matching
中图分类号:
曲著伟;李敏强. 基于数据区域发现的信息抽取规则生成方法[J]. 计算机工程, 2009, 35(22): 59-61.
QU Zhu-wei; LI Min-qiang. Information Extraction Rule Generation Method Based on Data Area Discovery[J]. Computer Engineering, 2009, 35(22): 59-61.