作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (19): 13-16,1. doi: 10.3969/j.issn.1000-3428.2009.19.005

• 博士论文 • 上一篇    下一篇

分布式多主题网络爬虫系统的研究与实现

白 鹤1,2,汤迪斌1,2,王劲林2   

  1. (1. 中国科学院研究生院,北京 100039; 2. 中国科学院声学研究所国家网络新媒体工程技术研究中心,北京 100190)

  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-10-05 发布日期:2009-10-05

Research and Implementation of Distributed and Multi-topic Web Crawler System

BAI He1,2, TANG Di-bin1,2, WANG Jin-lin2   

  1. (1. Graduate University of Chinese Academy of Sciences, Beijing 100039; 2. National Network New Media Technology Engineering Center, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-10-05 Published:2009-10-05

摘要: 提出一种基于数据抽取器的分布式爬虫架构。该架构采用基于分类标注的多主题策略,解决同一爬虫系统内多主题自适应兼容的问题。介绍二级加权任务分割算法,解决基于目标导向、负载均衡的URL分配问题,增强系统可扩展性。给出基于Trie树的URL存储策略的改进方法,可以高效地支持URL查询、插入和重复性检测。

关键词: 网络爬虫, 多主题, 分布式

Abstract: This paper proposes an architecture of distributed Web crawler system based on data-trapper. It implements a multi-topic schema based on classics-label, so that one crawler can contain different topics adaptively and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents’ load while providing better dynamic scalability. It improves URL storage with Trie tree, which efficiently supports URL search, insertion and repetition judgment.

Key words: Web crawler, multi-topic, distributed

中图分类号: