作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (14): 75-77. doi: 10.3969/j.issn.1000-3428.2009.14.026

• 软件技术与数据库 • 上一篇    下一篇

基于约束树编辑距离与导航树的信息采集

姜 波,丁岳伟   

  1. (上海理工大学计算机与电气工程学院,上海 200093)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-07-20 发布日期:2009-07-20

Information Extraction Based on Restricted Tree Edit Distance and Navigate Tree

JIANG Bo, DING Yue-wei   

  1. (School of Computer & Electrical Engineering, University of Shanghai for Science & Technology, Shanghai 200093)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-07-20 Published:2009-07-20

摘要: 介绍基于网站和网页结构的信息采集算法,提出一种基于约束树编辑距离的导航树算法。该算法通过提取网页的HTML的重要标记生成网页结构的标签树,对网页进行结构分析,通过约束树编辑距离算法判断爬行到的网页与主题的相关性,并根据网站基于URL的拓扑结构,提出基于导航树的信息采集约束信息采集器的爬行路径,提高了目标页面采集的效率和准确率。

关键词: 标签树, 树编辑距离, 导航树

Abstract: This paper introduces a crawling algorithm based on Web site structure and page structure and presents a new algorithm based on navigate tree. This algorithm analyzes the structure similarity of the pages crawled with the ordered labeled rooted tree based on the key HTML tags by the restricted tree edit distance algorithm and restricts the pages for the wrapper crawling by a navigate tree based on the URL similarity and the topology of the Web site. The efficiency and accuracy are improved.

Key words: labeled tree, tree edit distance, navigate tree

中图分类号: