作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (2): 34-36. doi: 10.3969/j.issn.1000-3428.2009.02.012

• 软件技术与数据库 • 上一篇    下一篇

基于流水线负载平衡模型的并行爬虫研究

孟祥乾,叶允明,邓 斌   

  1. (哈尔滨工业大学深圳研究生院,深圳 518055)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-01-20 发布日期:2009-01-20

Study on Parallel Crawler Based on Pipeline Load Balancing Model

MENG Xiang-qian, YE Yun-ming, DENG Bin   

  1. (Shenzhen Graduate School, Harbin Institute of Technology, Shenzhen 518055)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-01-20 Published:2009-01-20

摘要: 针对并行爬虫系统在多任务并发执行时所遇到的模块间负载平衡问题,提出流水线负载平衡模型(PLB),将不同的任务抽象为独立模块而达到各模块的处理速度相等,采用多线程的方式实现基于PLB的并行爬虫,根据线程的休眠和缓冲区的变化对线程数量进行动态调整以实现PLB。实验结果表明该方法具有良好的运行效率和稳定性。

关键词: 爬虫, 并行, 流水线, 负载平衡

Abstract: This paper proposes a load balancing model named Pipeline Load Balancing(PLB), to address the load balancing problem among concurrent modules in a parallel crawling system. Different tasks in PLB are implemented as independent modules which have similar processing abilities. Dynamic multi-threading and buffering mechanisms are employed to implement a PLB-based parallel crawler. The number of threads is adjusted according to the changing in buffer size and waiting interval of a thread. Experimental results show that the PLB-based crawler provides high performance as well as good stability.

Key words: crawler, parallel, pipeline, load balancing

中图分类号: