Abstract:
This paper proposes an architecture of distributed Web crawler system based on data-trapper. It implements a multi-topic schema based on classics-label, so that one crawler can contain different topics adaptively and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents’ load while providing better dynamic scalability. It improves URL storage with Trie tree, which efficiently supports URL search, insertion and repetition judgment.
Key words:
Web crawler,
multi-topic,
distributed
摘要: 提出一种基于数据抽取器的分布式爬虫架构。该架构采用基于分类标注的多主题策略,解决同一爬虫系统内多主题自适应兼容的问题。介绍二级加权任务分割算法,解决基于目标导向、负载均衡的URL分配问题,增强系统可扩展性。给出基于Trie树的URL存储策略的改进方法,可以高效地支持URL查询、插入和重复性检测。
关键词:
网络爬虫,
多主题,
分布式
CLC Number:
BAI He; TANG Di-bin; WANG Jin-lin. Research and Implementation of Distributed and Multi-topic Web Crawler System[J]. Computer Engineering, 2009, 35(19): 13-16,1.
白 鹤;汤迪斌;王劲林. 分布式多主题网络爬虫系统的研究与实现[J]. 计算机工程, 2009, 35(19): 13-16,1.