作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2013, Vol. 39 ›› Issue (6): 43-46. doi: 10.3969/j.issn.1000-3428.2013.06.008

• 先进计算与数据处理 • 上一篇    下一篇

Hadoop平台下的并行Web日志挖掘算法

周诗慧,殷 建   

  1. (山东大学(威海)机电与信息工程学院,山东 威海 264209)
  • 收稿日期:2012-07-23 出版日期:2013-06-15 发布日期:2013-06-14
  • 作者简介:周诗慧(1987-),女,硕士研究生,主研方向:数据挖掘,机器学习;殷 建,副教授

Parallel Web Log Mining Algorithm in Hadoop Platform

ZHOU Shi-hui, YIN Jian   

  1. (School of Mechanical, Electrical & Information Engineering, Shandong University at Weihai, Weihai 264209, China)
  • Received:2012-07-23 Online:2013-06-15 Published:2013-06-14

摘要: 当面对海量数据时,基于单一节点的Web数据挖掘存在时间和空间效率上的瓶颈。针对该问题,提出一种在Hadoop平台下实现Web日志挖掘的并行FP-growth算法,利用Hadoop分布式文件系统和MapReduce并行计算模型处理日志文件。实验结果表明,该算法的加速比能随着数据集的增大而提高,其执行效率优于串行FP-growth算法。

关键词: Hadoop框架, Web挖掘, Web日志, MapReduce编程模式, Hadoop分布式文件系统, 并行FP-growth算法

Abstract: The current Web data mining based on single node is developed to a time and space bottleneck. In order to solve these problems, this paper presents a parallel FP-growth algorithm to do Web log mining using Hadoop Distributed File System(HDFS) and MapReduce parallel computing model. Experimental results for different size datasets prove that the proposed algorithm reveals good speedup and has better performance than traditional FP-growth algorithm.

Key words: Hadoop framework, Web mining, Web log, MapReduce programming pattern, Hadoop Distributed File System(HDFS), parallel FP-growth algorithm

中图分类号: