网络爬行器的分布式设计

doi:10.3969/j.issn.1000-3428.2009.04.037

计算机工程 ›› 2009, Vol. 35 ›› Issue (4): 105-107.

网络爬行器的分布式设计

李卫疆1，赵铁军2，朴星海2

(1. 昆明理工大学省计算机应用重点实验室，昆明 650051；2. 哈尔滨工业大学计算机科学与技术学院，哈尔滨 150001)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-02-20 发布日期:2009-02-20

Distribution Design of Web Crawler

LI Wei-jiang1, ZHAO Tie-jun2, PIAO Xing-hai2

(1. Key Lab of Computer Application, Kunming University of Sciense and Technology, Kunming 650051;2. School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)

Received:1900-01-01 Revised:1900-01-01 Online:2009-02-20 Published:2009-02-20

摘要/Abstract

摘要： 目前单机版的网络爬行器已无法在一个有效的时间范围内完成一次搜集整个Web的任务。该文采用分布式网络爬行器加以解决。在分布式设计中，主要考虑节点内部多个线程的并行和节点之间的分布式并行，包括分布式网络爬行器的策略选择和动态可配置性2个方面。实验结果显示站点散列法基本达到了分布式设计的目标，在追求负载平衡的同时将系统的通信和管理开销降到最低。

关键词: 网络爬行器, 分布式, 多线程

Abstract: On the current scale of the Internet, the single Web crawler is unable to visit the entire Web in an effective time-frame. This paper develops a distributed Web crawler system to deal with it. In the distribution design, it mainly considers two facets of parallel. One is the multi-thread in the internal nodes; the other is distributed parallel among the nodes. It focuses on the distribution and parallel between nodes, and addresses two issues of the distributed Web crawler, which include the crawl strategy and dynamic configuration. Experimental results show that the hash function based on the Web site achieves the goal of the distributed Web crawler. The ability of the single node in distributed Web crawler should not decrease so much with the single Web crawler. Aiming at the load balance of the system, the communication and management costs reduce as much as possible.

Key words: Web crawler, distribution, multi-thread

中图分类号:

TP393

李卫疆;赵铁军;朴星海. 网络爬行器的分布式设计[J]. 计算机工程, 2009, 35(4): 105-107.

LI Wei-jiang; ZHAO Tie-jun; PIAO Xing-hai. Distribution Design of Web Crawler[J]. Computer Engineering, 2009, 35(4): 105-107.

https://www.ecice06.com/CN/Y2009/V35/I4/105

[1]	杜松霖, 仵大奎, 余云涛, 刘亚, 周文举. 基于协同优化算法的分布式装配车间调度[J]. 计算机工程, 2025, 51(3): 274-282.
[2]	刘怡, 张磊. 基于LT码的分布式矩阵计算研究[J]. 计算机工程, 2024, 50(8): 328-335.
[3]	郑清安, 董建成, 陈亮, 阮英清, 李锦松, 许林彬. 分布式可信数据管理与隐私保护技术研究[J]. 计算机工程, 2024, 50(7): 174-186.
[4]	杨思捷, 陈俊奇, 王勇, 李树林. 基于FPGA的软硬件协同纠删码编码加速方案[J]. 计算机工程, 2024, 50(2): 224-231.
[5]	申秀雨, 姬伟峰, 李映岐, 吴玄. 面向边缘计算的TCA1C DDoS检测模型[J]. 计算机工程, 2024, 50(1): 198-205.
[6]	胡宗升, 邢凯, 许静. 基于超越数论的无线传感器网络时空编码方法[J]. 计算机工程, 2023, 49(9): 172-182.
[7]	李博, 黄东强, 贾金芳, 吴利, 王晓英, 黄建强. 基于CPU与GPU的异构模板计算优化研究[J]. 计算机工程, 2023, 49(4): 131-137.
[8]	苏瑞国, 阳建, 秦继伟, 武晓雄, 贾振红. 基于物联网区块链的轻量级共识算法研究[J]. 计算机工程, 2023, 49(2): 175-180.
[9]	王一泠, 吴琦, 安军社. 支持MIPS架构的轻量型开源鸿蒙系统移植[J]. 计算机工程, 2023, 49(12): 25-34, 45.
[10]	丁庆丰, 李晋国. 一种物联网环境下的分布式异常流量检测方案[J]. 计算机工程, 2022, 48(8): 152-159.
[11]	黄华威, 孔伟, 彭肖文, 郑子彬. 区块链分片技术综述[J]. 计算机工程, 2022, 48(6): 1-10.
[12]	王劲松, 杨唯正, 赵泽宁, 魏佳佳. 基于有向无环图的区块链技术综述[J]. 计算机工程, 2022, 48(6): 11-23.
[13]	陈凰, 陈睿, 邝祝芳, 黄华军. 一种频率域相关性分布式扩散最小均方算法[J]. 计算机工程, 2022, 48(5): 215-221.
[14]	杨珂, 张帆, 郭威, 赵博, 穆清. 一种拟态存储元数据随机性问题解决方法[J]. 计算机工程, 2022, 48(2): 140-146,155.
[15]	柏财通, 崔翛龙, 李爱. 基于本地蒸馏联邦学习的鲁棒语音识别技术[J]. 计算机工程, 2022, 48(10): 103-109.

选择文件类型/文献管理软件名称

选择包含的内容

网络爬行器的分布式设计

Distribution Design of Web Crawler

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

网络爬行器的分布式设计

Distribution Design of Web Crawler

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价