摘要: 传统数据挖掘算法在处理海量数据集时计算能力有限。为解决该问题,提出一种基于Map Reduce的分布式序列模式挖掘算法MR-PrefixSpan。在PrefixSpan算法的基础上,对模式挖掘任务进行分割,利用Map函数处理由不同前缀得到的序列模式,并行构造投影数据库,从而提高挖掘效率及简化搜索空间。采用Reduce函数对中间结果进行规约,得到全局序列模式。在Hadoop集群上的实验结果表明,MR-PrefixSpan能减少数据库扫描时间,具有较高的并行加速比和较好的可扩展性。
关键词:
云计算,
并行处理,
Map Reduce模型,
PrefixSpan算法,
序列模式,
Hadoop平台
Abstract: Traditional data mining algorithm has computing power shortage in dealing with mass data set. Aiming at the problem, a distributed sequential pattern mining algorithm based on Map Reduce programming model named MR-PrefixSpan is proposed. Mining tasks are decomposed to many, the Map function is used to mine each Prefix-projected sequential pattern, and the projected databases are constructed parallelly. It simplifies the search space and acquires a higher mining efficiency. Then the intermediate values are passed to a Reduce function which merges together all these values to produce a possibly smaller set of values. Experimental results on Hadoop cluster show that MR-PrefixSpan can reduce the time of scanning data base, has higher parallel speed up ratio and better expansibility.
Key words:
cloud computing,
parallel processing,
Map Reduce model,
PrefixSpan algorithm,
sequential pattern,
Hadoop platform
中图分类号:
刘栋, 尉永清, 薛文娟. 基于Map Reduce的序列模式挖掘算法[J]. 计算机工程, 2012, 38(15): 43-45.
LIU Dong, WEI Yong-Qing, XUE Wen-Juan. Sequential Pattern Mining Algorithm Based on Map Reduce[J]. Computer Engineering, 2012, 38(15): 43-45.