计算机工程 ›› 2019, Vol. 45 ›› Issue (11): 62-67.doi: 10.19678/j.issn.1000-3428.0053439

• 先进计算与数据处理 • 上一篇    下一篇

一种高效的分布式爬虫系统负载均衡策略

张树涛1,2, 谭海波1, 陈良锋1, 吕波1   

  1. 1. 中国科学院合肥物质科学研究院, 合肥 230039;
    2. 中国科学技术大学 研究生院, 合肥 230039
  • 收稿日期:2018-12-19 修回日期:2019-01-23 出版日期:2019-11-15 发布日期:2019-02-22
  • 作者简介:张树涛(1995-),男,硕士研究生,主研方向为数据挖掘、网络安全;谭海波,研究员、博士;陈良锋,工程师、博士;吕波,工程师、硕士。
  • 基金项目:
    安徽省科技重大专项"基于大数据的中小微企业精准智力服务平台"(711245801052)。

An Efficient Load Balance Strategy for Distributed Crawler System

ZHANG Shutao1,2, TAN Haibo1, CHEN Liangfeng1, Lü Bo1   

  1. 1. Hefei Institutes of Physical Science, Chinese Academy of Sciences, Hefei 230039, China;
    2. Graduate School, University of Science and Technology of China, Hefei 230039, China
  • Received:2018-12-19 Revised:2019-01-23 Online:2019-11-15 Published:2019-02-22

摘要: 传统分布式爬虫系统负载均衡方法仅考虑少量的负载影响因素,未对各爬虫节点负载情况进行全面有效的评估,使得任务量的分配不合理。针对该问题,提出一种面向分布式爬虫系统的高效负载均衡策略。分析影响爬虫节点运行时间的因素,采用BP神经网络构建基于多影响因素的非线性分布式爬虫节点运行时间模型。以该模型预测的各子节点运行时间的最小方差为负载均衡策略的目标函数,并利用带约束条件的改进粒子群优化算法求解目标函数,确定负载均衡的任务分配方案。实验结果表明,该负载均衡策略在满足爬虫节点高性能要求的前提下,能有效缩短分布式爬虫系统的运行时间。

关键词: 分布式爬虫, 负载均衡, 预测模型, 粒子群优化算法, 约束条件

Abstract: Traditional load balance methods for distributed crawlers fail in providing comprehensively efficient evaluation of crawler node loads,as they consider only a small number of affecting factors in load.Thus the tasks are not reasonably assigned.To address the problem,this paper proposes an efficient load balance strategy for distributed crawlers.The strategy analyzes affecting factors in the running time of crawler nodes,and uses BP neural network to construct a non-linear running time model based on multiple affecting factors for distributed crawler nodes.The model predicts the running time of each sub-node,and the minimum variance of the running time is taken as the target function of load balance strategies.The target function is resolved by using improved particle swarm optimization algorithm with constraints to form a task assignment scheme with balanced loads.Experimental results show that the load balance strategy can efficiently reduce the running time of distributed crawlers while meeting the high performance requirements of crawler nodes.

Key words: distributed crawler, load balance, prediction model, particle swarm optimization algorithm, constraint condition

中图分类号: