摘要: 查询相关的Web 页面中的数据记录之间具有极高的代码结构相似性,Web 数据记录对应的DOM 子树之间自然也就具有很高的结构相似性。针对查询相关的Web 页面的特点,提出了一种基于DOM 子树匹配的交互式Web 数据抽取方法,实验证明,该方法能保证很高的数据抽取查全率和准确率。
关键词:
Web 数据抽取;Top-down 树匹配;DOM
Abstract: There is high structural comparability among the HTML codes of Web data rows in query-related Web pages. Naturally, the structures of sub DOM trees are similar to each other. An approach based on sub-tree matching algorithm for interactive query-related Web data extraction is represented. The result of the experiment shows high accuracy in terms of recall and precision.
Key words:
Web data extraction; Top-down tree matching; DOM
张慧颖,曲著伟. 基于子树匹配的交互式 Web 数据抽取方法[J]. 计算机工程, 2006, 32(9): 78-80.
ZHANG Huiying, QU Zhuwei. Approach for Interactive Web Data Extraction Based on Sub-tree Matching[J]. Computer Engineering, 2006, 32(9): 78-80.