Abstract:
On the current scale of the Internet, the single Web crawler is unable to visit the entire Web in an effective time-frame. This paper develops a distributed Web crawler system to deal with it. In the distribution design, it mainly considers two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. It focuses on the distribution and parallel between nodes, and addresses two issues of the distributed Web crawler, which include the crawl strategy and dynamic configuration. Experimental results show that the hash function based on the Web site achieves the goal of the distributed Web crawler. The ability of the single node in distributed Web crawler should not decrease so much with the single Web crawler. Aiming at the load balance of the system, the communication and management costs reduce as much as possible.
Key words:
Web crawler,
distribution,
multi-thread
摘要: 目前单机版的网络爬行器已无法在一个有效的时间范围内完成一次搜集整个Web的任务。该文采用分布式网络爬行器加以解决。在分布式设计中,主要考虑节点内部多个线程的并行和节点之间的分布式并行,包括分布式网络爬行器的策略选择和动态可配置性2个方面。实验结果显示站点散列法基本达到了分布式设计的目标,在追求负载平衡的同时将系统的通信和管理开销降到最低。
关键词:
网络爬行器,
分布式,
多线程
CLC Number:
LI Wei-jiang; ZHAO Tie-jun; PIAO Xing-hai. Distribution Design of Web Crawler[J]. Computer Engineering, 2009, 35(4): 105-107.
李卫疆;赵铁军;朴星海. 网络爬行器的分布式设计[J]. 计算机工程, 2009, 35(4): 105-107.