摘要: 介绍基于网站和网页结构的信息采集算法,提出一种基于约束树编辑距离的导航树算法。该算法通过提取网页的HTML的重要标记生成网页结构的标签树,对网页进行结构分析,通过约束树编辑距离算法判断爬行到的网页与主题的相关性,并根据网站基于URL的拓扑结构,提出基于导航树的信息采集约束信息采集器的爬行路径,提高了目标页面采集的效率和准确率。
关键词:
标签树,
树编辑距离,
导航树
Abstract: This paper introduces a crawling algorithm based on Web site structure and page structure and presents a new algorithm based on navigate tree. This algorithm analyzes the structure similarity of the pages crawled with the ordered labeled rooted tree based on the key HTML tags by the restricted tree edit distance algorithm and restricts the pages for the wrapper crawling by a navigate tree based on the URL similarity and the topology of the Web site. The efficiency and accuracy are improved.
Key words:
labeled tree,
tree edit distance,
navigate tree
中图分类号:
姜 波;丁岳伟. 基于约束树编辑距离与导航树的信息采集[J]. 计算机工程, 2009, 35(14): 75-77.
JIANG Bo; DING Yue-wei. Information Extraction Based on Restricted Tree Edit Distance and Navigate Tree[J]. Computer Engineering, 2009, 35(14): 75-77.