Research and Implementation of Distributed     and Multi-topic Web Crawler System

doi:10.3969/j.issn.1000-3428.2009.19.005

Computer Engineering ›› 2009, Vol. 35 ›› Issue (19): 13-16,1.

• Degree Paper • Previous Articles Next Articles

Research and Implementation of Distributed and Multi-topic Web Crawler System

BAI He1,2, TANG Di-bin1,2, WANG Jin-lin2

(1. Graduate University of Chinese Academy of Sciences, Beijing 100039; 2. National Network New Media Technology Engineering Center, Institute of Acoustics, Chinese Academy of Sciences, Beijing 100190)

Received:1900-01-01 Revised:1900-01-01 Online:2009-10-05 Published:2009-10-05

分布式多主题网络爬虫系统的研究与实现

白鹤1,2，汤迪斌1,2，王劲林2

(1. 中国科学院研究生院，北京 100039； 2. 中国科学院声学研究所国家网络新媒体工程技术研究中心，北京 100190)

Abstract

Abstract: This paper proposes an architecture of distributed Web crawler system based on data-trapper. It implements a multi-topic schema based on classics-label, so that one crawler can contain different topics adaptively and designs a two-tiered weighted task partition algorithm that realizes target-guided URL configuration based on Agents’ load while providing better dynamic scalability. It improves URL storage with Trie tree, which efficiently supports URL search, insertion and repetition judgment.

Key words: Web crawler, multi-topic, distributed

摘要： 提出一种基于数据抽取器的分布式爬虫架构。该架构采用基于分类标注的多主题策略，解决同一爬虫系统内多主题自适应兼容的问题。介绍二级加权任务分割算法，解决基于目标导向、负载均衡的URL分配问题，增强系统可扩展性。给出基于Trie树的URL存储策略的改进方法，可以高效地支持URL查询、插入和重复性检测。

关键词: 网络爬虫, 多主题, 分布式

CLC Number:

TP393

BAI He; TANG Di-bin; WANG Jin-lin. Research and Implementation of Distributed and Multi-topic Web Crawler System[J]. Computer Engineering, 2009, 35(19): 13-16,1.

白鹤;汤迪斌;王劲林. 分布式多主题网络爬虫系统的研究与实现[J]. 计算机工程, 2009, 35(19): 13-16,1.

/ Recommend / Download Citations

URL:

https://www.ecice06.com/EN/Y2009/V35/I19/13

[1]	DU Songlin, WU Dakui, YU Yuntao, LIU Ya, ZHOU Wenju. Distributed Assembly Workshop Scheduling Based on Collaborative Optimization Algorithm [J]. Computer Engineering, 2025, 51(3): 274-282.
[2]	ZHANG Wenxin, LIU Yujie, WANG Zhaoyong, SUN Haomiao, LI Zongmin. End-to-End Person Search Method Based on Prototype Separation Network [J]. Computer Engineering, 2025, 51(1): 269-276.
[3]	Yi LIU, Lei ZHANG. Research on Distributed Matrix Computing Based on LT Code [J]. Computer Engineering, 2024, 50(8): 328-335.
[4]	Qing'an ZHENG, Jiancheng DONG, Liang CHEN, Yingqing RUAN, Jinsong LI, Linbin XU. Research on Distributed Trusted Data Management and Privacy Protection Technology [J]. Computer Engineering, 2024, 50(7): 174-186.
[5]	Sijie YANG, Junqi CHEN, Yong WANG, Shulin LI. FPGA-based Software and Hardware Cooperative Acceleration Scheme of Erasure Code Encoding [J]. Computer Engineering, 2024, 50(2): 224-231.
[6]	Chenjun ZHENG, Yan ZENG, Junfeng YUAN, Jilin ZHANG, Xin WANG, Meng HAN. Ship AIS Trajectory Prediction Algorithm Based on Federated Learning [J]. Computer Engineering, 2024, 50(2): 298-307.
[7]	Xiuyu SHEN, Weifeng JI, Yingqi LI, Xuan WU. TCA1C DDoS Detection Model for Edge Computing [J]. Computer Engineering, 2024, 50(1): 198-205.
[8]	Zongsheng HU, Kai XING, Jing XU. Spatio-Temporal Coding Method for Wireless Sensor Networks Based on Transcendence Number Theory [J]. Computer Engineering, 2023, 49(9): 172-182.
[9]	SU Ruiguo, YANG Jian, QIN Jiwei, WU Xiaoxiong, JIA Zhenhong. Research on Lightweight Consensus Algorithm Based on IoT Blockchain [J]. Computer Engineering, 2023, 49(2): 175-180.
[10]	Yiling WANG, Qi WU, Junshe AN. Porting of Lightweight OpenHarmony System Supporting MIPS Architecture [J]. Computer Engineering, 2023, 49(12): 25-34, 45.
[11]	DING Qingfeng, LI Jinguo. A Distributed Abnormal Traffic Detection Scheme in Internet of Things Environment [J]. Computer Engineering, 2022, 48(8): 152-159.
[12]	HUANG Huawei, KONG Wei, PENG Xiaowen, ZHENG Zibin. Survey on Blockchain Sharding Technology [J]. Computer Engineering, 2022, 48(6): 1-10.
[13]	WANG Jinsong, YANG Weizheng, ZHAO Zening, WEI Jiajia. Survey of Directed Acyclic Graph Based Blockchain Technology [J]. Computer Engineering, 2022, 48(6): 11-23.
[14]	CHEN Huang, CHEN Rui, KUANG Zhufang, HUANG Huajun. A Frequency-domain Correlation Distributed Diffusion Least Mean Square Algorithm [J]. Computer Engineering, 2022, 48(5): 215-221.
[15]	YANG Ke, ZHANG Fan, GUO Wei, ZHAO Bo, MU Qing. A Method for Solving the Metadata Randomness Problem of Mimic Storage [J]. Computer Engineering, 2022, 48(2): 140-146,155.

Please choose a citation manager

Content to export

Research and Implementation of Distributed and Multi-topic Web Crawler System

分布式多主题网络爬虫系统的研究与实现

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Research and Implementation of Distributed and Multi-topic Web Crawler System

分布式多主题网络爬虫系统的研究与实现

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments