Distribution Design of Web Crawler

doi:10.3969/j.issn.1000-3428.2009.04.037

Computer Engineering ›› 2009, Vol. 35 ›› Issue (4): 105-107.

• Networks and Communications • Previous Articles Next Articles

Distribution Design of Web Crawler

LI Wei-jiang1, ZHAO Tie-jun2, PIAO Xing-hai2

(1. Key Lab of Computer Application, Kunming University of Sciense and Technology, Kunming 650051;2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)

Received:1900-01-01 Revised:1900-01-01 Online:2009-02-20 Published:2009-02-20

网络爬行器的分布式设计

李卫疆1，赵铁军2，朴星海2

(1. 昆明理工大学省计算机应用重点实验室，昆明 650051；2. 哈尔滨工业大学计算机科学与技术学院，哈尔滨 150001)

Abstract

Abstract: On the current scale of the Internet, the single Web crawler is unable to visit the entire Web in an effective time-frame. This paper develops a distributed Web crawler system to deal with it. In the distribution design, it mainly considers two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. It focuses on the distribution and parallel between nodes, and addresses two issues of the distributed Web crawler, which include the crawl strategy and dynamic configuration. Experimental results show that the hash function based on the Web site achieves the goal of the distributed Web crawler. The ability of the single node in distributed Web crawler should not decrease so much with the single Web crawler. Aiming at the load balance of the system, the communication and management costs reduce as much as possible.

Key words: Web crawler, distribution, multi-thread

摘要： 目前单机版的网络爬行器已无法在一个有效的时间范围内完成一次搜集整个Web的任务。该文采用分布式网络爬行器加以解决。在分布式设计中，主要考虑节点内部多个线程的并行和节点之间的分布式并行，包括分布式网络爬行器的策略选择和动态可配置性2个方面。实验结果显示站点散列法基本达到了分布式设计的目标，在追求负载平衡的同时将系统的通信和管理开销降到最低。

关键词: 网络爬行器, 分布式, 多线程

CLC Number:

TP393

LI Wei-jiang; ZHAO Tie-jun; PIAO Xing-hai. Distribution Design of Web Crawler[J]. Computer Engineering, 2009, 35(4): 105-107.

李卫疆;赵铁军;朴星海. 网络爬行器的分布式设计[J]. 计算机工程, 2009, 35(4): 105-107.

/ Recommend / Download Citations

URL:

https://www.ecice06.com/EN/Y2009/V35/I4/105

[1]	LI Qiwen, WANG Zhihe, DU Hui, LU Depeng. Adaptive Density Peak Clustering Algorithm Based on Gaussian Distribution [J]. Computer Engineering, 2025, 51(4): 137-148.
[2]	CAI Ruichu, XU Zunhong, CHEN Daoxin, YANG Zhenhui, LI Zijian, HAO Zhifeng. Causal Mechanism-Based Molecular Property Prediction [J]. Computer Engineering, 2025, 51(3): 105-112.
[3]	HU Chaoju, GUO Fengyi. MODF Port State Detection Algorithm Based on Improved YOLOv7 [J]. Computer Engineering, 2025, 51(2): 78-85.
[4]	WU Xiaohong, LI Pei, GU Yonggen, TAO Jie. Hierarchical Federated Learning Algorithm Based on EMD Optimal Matching [J]. Computer Engineering, 2025, 51(2): 170-178.
[5]	LU Ming, CHEN Cifa, DONG Fangmin. Research on Improved Consensus Algorithm for Proof of Stake Based on Comprehensive Integral Mechanism [J]. Computer Engineering, 2025, 51(1): 148-155.
[6]	LUO Xudong, YUAN Di, CHANG Xiaojun, HE Zhenyu. Underwater Target Tracking Based on Uncertainty-Inspired Image Enhancement [J]. Computer Engineering, 2025, 51(1): 11-19.
[7]	ZHANG Huiying, SHENG Wenshun. Improved Algorithm for Facial Age Recognition Based on Label Adaptation [J]. Computer Engineering, 2025, 51(1): 174-181.
[8]	HAN Meihui, WANG Peng, LI Ruixu, LIU Zhongyao. An Adaptive Constrained Multi-Objective Evolutionary Algorithm Based on Co-Evolutionary [J]. Computer Engineering, 2024, 50(6): 124-137.
[9]	GU Yonggen, GAO Lingxuan, WU Xiaohong, TAO Jie. Research on Data Sharing of Federated Semi-Supervised Learning with Non-IID [J]. Computer Engineering, 2024, 50(6): 188-196.
[10]	ZHANG Yiheng, LIU Yian, SONG Hailing. Design of Frequency-Hopping Sequence Based on Enhanced Runge Kutta Optimizer [J]. Computer Engineering, 2024, 50(4): 267-276.
[11]	Huawei SONG, Shengqi LI, Fangjie WAN, Yuping WEI. Federated Learning Optimization Method in Non-IID Scenarios [J]. Computer Engineering, 2024, 50(3): 166-172.
[12]	SHU Da, LIANG Chengji, WANG Yu, SUN Miaomiao. Research on Multi-Level Distribution Location-Route of Urban Subway Based on Bilateral Matching [J]. Computer Engineering, 2024, 50(11): 369-379.
[13]	Cong HUANG, Yaobin ZOU, Shuifa SUN. Multi-threshold Segmentation Method with High Accuracy and Adaptability Using Circular Histogram Linearization [J]. Computer Engineering, 2024, 50(1): 259-270.
[14]	Zhidong SHEN, Hengxian YUE. Textual Adversarial Training Method Based on Distributed Perturbation [J]. Computer Engineering, 2023, 49(9): 16-22.
[15]	Xiaoli LIU, Yitong WANG. Multi-density Graph-based Session Recommendation Using Self-supervised Learning [J]. Computer Engineering, 2023, 49(9): 60-68, 78.

Please choose a citation manager

Content to export

Distribution Design of Web Crawler

网络爬行器的分布式设计

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Distribution Design of Web Crawler

网络爬行器的分布式设计

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments