摘要: 为了提高Web海量数据的抓掘效率,引入并行机群抓掘机制。为使机群中每个计算节点的能力得到充分发挥,应用向量度量技术解决抓取任务和计算节点能力匹配的问题。对抓取任务向量、计算节点向量进行定义,提出余弦向量匹配算法,描述相关并行算法。理论分析和实验表明,基于余弦向量匹配算法的挖掘任务分配模型具有良好的分配适应性和负载平衡性。
关键词:
并行抓取,
余弦向量法,
计算机机群
Abstract: This paper proposes a parallel cluster crawling model to improve the mining efficiency of massive data on Web. For fully using of the ability of parallel nodes in computer cluster, a vector measurement technology is introduced to solve the matching problem between crawling task and computer node. After giving the definitions of crawling task vector and computer node vector, cosine vector similarity formula is described, and the parallel crawling algorithms is designed. Experimental results show that the system is effective in distribution adaptability and load balance.
Key words:
parallel crawling,
cosine vector,
computer cluster
中图分类号:
徐文杰;陈庆奎. 基于余弦向量法的Web数据并行抓掘系统[J]. 计算机工程, 2009, 35(7): 64-67.
XU Wen-jie; CHEN Qing-kui. Parallel Crawling System for Web Data Based on Cosine Vector[J]. Computer Engineering, 2009, 35(7): 64-67.