作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (15): 43-45. doi: 10.3969/j.issn.1000-3428.2012.15.012

• 软件技术与数据库 • 上一篇    下一篇

基于Map Reduce的序列模式挖掘算法

刘 栋1,2,尉永清3,薛文娟1,2   

  1. (1. 山东师范大学信息科学与工程学院,济南 250014;2. 山东省分布式计算机软件新技术重点实验室,济南 250014; 3. 山东警察学院公共基础部,济南 250014)
  • 收稿日期:2011-10-11 出版日期:2012-08-05 发布日期:2012-08-05
  • 作者简介:刘 栋(1987-),男,硕士研究生,CCF会员,主研方向:网络信息安全,云计算;尉永清,教授;薛文娟,硕士研究生
  • 基金资助:
    国家自然科学基金资助项目(60873247);山东省自然科学基金资助项目(ZR2009GZ007)

Sequential Pattern Mining Algorithm Based on Map Reduce

LIU Dong 1,2, WEI Yong-qing 3, XUE Wen-juan 1,2   

  1. (1. School of Information Science and Engineering, Shandong Normal University, Jinan 250014, China; 2. Shandong Provincial Key Laboratory for Distributed Computer Software Novel Technology, Jinan 250014, China; 3. Basic Education Department, Shandong Police College, Jinan 250014, China)
  • Received:2011-10-11 Online:2012-08-05 Published:2012-08-05

摘要: 传统数据挖掘算法在处理海量数据集时计算能力有限。为解决该问题,提出一种基于Map Reduce的分布式序列模式挖掘算法MR-PrefixSpan。在PrefixSpan算法的基础上,对模式挖掘任务进行分割,利用Map函数处理由不同前缀得到的序列模式,并行构造投影数据库,从而提高挖掘效率及简化搜索空间。采用Reduce函数对中间结果进行规约,得到全局序列模式。在Hadoop集群上的实验结果表明,MR-PrefixSpan能减少数据库扫描时间,具有较高的并行加速比和较好的可扩展性。

关键词: 云计算, 并行处理, Map Reduce模型, PrefixSpan算法, 序列模式, Hadoop平台

Abstract: Traditional data mining algorithm has computing power shortage in dealing with mass data set. Aiming at the problem, a distributed sequential pattern mining algorithm based on Map Reduce programming model named MR-PrefixSpan is proposed. Mining tasks are decomposed to many, the Map function is used to mine each Prefix-projected sequential pattern, and the projected databases are constructed parallelly. It simplifies the search space and acquires a higher mining efficiency. Then the intermediate values are passed to a Reduce function which merges together all these values to produce a possibly smaller set of values. Experimental results on Hadoop cluster show that MR-PrefixSpan can reduce the time of scanning data base, has higher parallel speed up ratio and better expansibility.

Key words: cloud computing, parallel processing, Map Reduce model, PrefixSpan algorithm, sequential pattern, Hadoop platform

中图分类号: