作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (7): 64-67. doi: 10.3969/j.issn.1000-3428.2009.07.021

• 软件技术与数据库 • 上一篇    下一篇

基于余弦向量法的Web数据并行抓掘系统

徐文杰,陈庆奎   

  1. (上海理工大学计算机与电气工程学院,上海 200093)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-04-05 发布日期:2009-04-05

Parallel Crawling System for Web Data Based on Cosine Vector

XU Wen-jie, CHEN Qing-kui   

  1. (School of Computer and Electrical Engineering, University of Shanghai for Science and Technology, Shanghai 200093)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-04-05 Published:2009-04-05

摘要: 为了提高Web海量数据的抓掘效率,引入并行机群抓掘机制。为使机群中每个计算节点的能力得到充分发挥,应用向量度量技术解决抓取任务和计算节点能力匹配的问题。对抓取任务向量、计算节点向量进行定义,提出余弦向量匹配算法,描述相关并行算法。理论分析和实验表明,基于余弦向量匹配算法的挖掘任务分配模型具有良好的分配适应性和负载平衡性。

关键词: 并行抓取, 余弦向量法, 计算机机群

Abstract: This paper proposes a parallel cluster crawling model to improve the mining efficiency of massive data on Web. For fully using of the ability of parallel nodes in computer cluster, a vector measurement technology is introduced to solve the matching problem between crawling task and computer node. After giving the definitions of crawling task vector and computer node vector, cosine vector similarity formula is described, and the parallel crawling algorithms is designed. Experimental results show that the system is effective in distribution adaptability and load balance.

Key words: parallel crawling, cosine vector, computer cluster

中图分类号: