作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (4): 105-107. doi: 10.3969/j.issn.1000-3428.2009.04.037

• 网络与通信 • 上一篇    下一篇

网络爬行器的分布式设计

李卫疆1,赵铁军2,朴星海2   

  1. (1. 昆明理工大学省计算机应用重点实验室,昆明 650051;2. 哈尔滨工业大学计算机科学与技术学院,哈尔滨 150001)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-02-20 发布日期:2009-02-20

Distribution Design of Web Crawler

LI Wei-jiang1, ZHAO Tie-jun2, PIAO Xing-hai2   

  1. (1. Key Lab of Computer Application, Kunming University of Sciense and Technology, Kunming 650051;2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-02-20 Published:2009-02-20

摘要: 目前单机版的网络爬行器已无法在一个有效的时间范围内完成一次搜集整个Web的任务。该文采用分布式网络爬行器加以解决。在分布式设计中,主要考虑节点内部多个线程的并行和节点之间的分布式并行,包括分布式网络爬行器的策略选择和动态可配置性2个方面。实验结果显示站点散列法基本达到了分布式设计的目标,在追求负载平衡的同时将系统的通信和管理开销降到最低。

关键词: 网络爬行器, 分布式, 多线程

Abstract: On the current scale of the Internet, the single Web crawler is unable to visit the entire Web in an effective time-frame. This paper develops a distributed Web crawler system to deal with it. In the distribution design, it mainly considers two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. It focuses on the distribution and parallel between nodes, and addresses two issues of the distributed Web crawler, which include the crawl strategy and dynamic configuration. Experimental results show that the hash function based on the Web site achieves the goal of the distributed Web crawler. The ability of the single node in distributed Web crawler should not decrease so much with the single Web crawler. Aiming at the load balance of the system, the communication and management costs reduce as much as possible.

Key words: Web crawler, distribution, multi-thread

中图分类号: