作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (3): 81-89. doi: 10.19678/j.issn.1000-3428.0060253

• 人工智能与模式识别 • 上一篇    下一篇

基于MapReduce的高维数据频繁项集挖掘

赵欣灿1, 朱云1, 毛伊敏2   

  1. 1. 江西理工大学 理学院, 江西赣州 341000;
    2. 江西理工大学 信息工程学院, 江西赣州 341000
  • 收稿日期:2020-12-10 修回日期:2021-03-09 发布日期:2021-03-15
  • 作者简介:赵欣灿(1996-),女,硕士研究生,主研方向为大数据、数据挖掘;朱云,副教授、博士;毛伊敏,教授、博士。
  • 基金资助:
    国家重点研发计划(2018YFC1504705);国家自然科学基金(41562019);江西省教育厅科技项目(GJJ151528,GJJ151531)。

Frequent Itemset Mining of High-Dimensional Data Based on MapReduce

ZHAO Xincan1, ZHU Yun1, MAO Yimin2   

  1. 1. School of Science, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China;
    2. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Received:2020-12-10 Revised:2021-03-09 Published:2021-03-15

摘要: 传统的数据挖掘算法在面向大规模高维数据的挖掘过程中,存在数据特征捕捉准确率低、节点负载不均衡、数据交互频繁、频繁项集紧凑化程度低等问题。提出基于MapReduce的并行挖掘算法PARDG-MR,结合高维数据特征,设计基于维度粒化算法和负载均衡算法的DGPL策略,并对数据进行预处理,以解决高维复杂数据特征属性捕捉困难及数据划分中节点负载不均衡的问题。通过构建基于PJPFP-Tree树的频繁项集并行挖掘策略PARM,实现频繁项集的并行化分组过程,从而提高数据处理的运行效率。在此基础上,提出基于剪枝前缀推论的整合节点剪枝算法PJPFP,提高频繁项集挖掘过程中的剪枝效率,增强频繁项集的紧凑化程度。在Webdocs、NDC、Gisette 3个数据集上的实验结果表明,相比PFP-growth、PWARM、MRPrePost算法,该算法的运行时间平均缩短了约20%,能够有效提高数据挖掘效率且降低内存空间。

关键词: 高维数据, 频繁项集, 维度粒化, 并行化, 候选剪枝策略

Abstract: In the mining process of large-scale high-dimensional data, the traditional data mining algorithm has some problem, such as low accuracy of data feature capture, unbalanced node load, frequent data interaction, and low compactness of frequent itemset.Therefore, this paper proposes a parallel mining algorithm, PARDG-MR which is based on MapReduce.By combining the characteristics of high-dimensional data, a DGPL strategy based on the dimensional granulation algorithm and load balancing algorithm are designed.Data are preprocessed to solve the problems of difficult feature attribute capture of high-dimensional complex data and unbalanced node load in the data division.The parallel grouping process of frequent itemset is realized by constructing a parallel mining strategy PARM of frequent itemset based on the PJPFP-Tree to improve the operation efficiency of data processing.On this basis, it proposes an integrated node pruning algorithm PJPFP based on the pruning prefix inference, which improves the pruning efficiency in the process of frequent itemset mining to enhance the compactness of frequent itemset and improve the overall mining efficiency of the algorithm.The experimental results on Webdocs, NDC, and Gisette data sets show that compared with PFP-growth, PWARM and MRPrePost algorithms, the running time of PARDG-MR is shorter by approximately 20% on average, therefore, it is more effective and efficient in data mining.

Key words: high-dimensional data, frequent itemset, dimensional granulation, parallel, candidate pruning strategy

中图分类号: