摘要: 当面对海量数据时,基于单一节点的Web数据挖掘存在时间和空间效率上的瓶颈。针对该问题,提出一种在Hadoop平台下实现Web日志挖掘的并行FP-growth算法,利用Hadoop分布式文件系统和MapReduce并行计算模型处理日志文件。实验结果表明,该算法的加速比能随着数据集的增大而提高,其执行效率优于串行FP-growth算法。
关键词:
Hadoop框架,
Web挖掘,
Web日志,
MapReduce编程模式,
Hadoop分布式文件系统,
并行FP-growth算法
Abstract: The current Web data mining based on single node is developed to a time and space bottleneck. In order to solve these problems, this paper presents a parallel FP-growth algorithm to do Web log mining using Hadoop Distributed File System(HDFS) and MapReduce parallel computing model. Experimental results for different size datasets prove that the proposed algorithm reveals good speedup and has better performance than traditional FP-growth algorithm.
Key words:
Hadoop framework,
Web mining,
Web log,
MapReduce programming pattern,
Hadoop Distributed File System(HDFS),
parallel FP-growth algorithm
中图分类号:
周诗慧, 殷建. Hadoop平台下的并行Web日志挖掘算法[J]. 计算机工程, 2013, 39(6): 43-46.
ZHOU Shi-Hui, YAN Jian. Parallel Web Log Mining Algorithm in Hadoop Platform[J]. Computer Engineering, 2013, 39(6): 43-46.