Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

Maximal Frequent Itemset Mining Algorithm Based on Nodeset

LIN Chen,GU Junzhong   

  1. (Department of Computer Science and Technology,East China Normal University,Shanghai 200241,China)
  • Received:2015-11-13 Online:2016-12-15 Published:2016-12-15

基于Nodeset的最大频繁项集挖掘算法

林晨,顾君忠   

  1. (华东师范大学 计算机科学技术系,上海 200241)
  • 作者简介:林晨(1990—),男,硕士研究生,主研方向为数据挖掘、情景感知计算;顾君忠,教授、博士生导师。
  • 基金资助:
    上海市国际科技合作项目(13430710100);上海市科委科技创新行动计划项目(13511506201)。

Abstract: The major performance bottlenecks of most maximal frequent itemset mining algorithms based on FP-Tree are caused by recursively traversing and constructing conditional FP-Trees and superset check.Therefore,this paper proposes a maximal frequent itemset mining algorithm based on Nodeset named MFIN.Instead of recursively traversing and constructing conditional FP-Trees,this algorithm adopts a novel data structure called Nodeset to encode the nodes of POC-Tree and uses a set-enumeration tree to represent search space.Besides,this paper proposes an early-stopping method to minimize the cost of intersection operation of Nodesets and adopts parent equivalence pruning technique and look-ahead pruning technique to reduce the search space.The efficiency of superset check method is promoted by improving the projection strategy based on MFI-Tree.Experimental results show that the performance of MFIN algorithm is superior to the classic FP-Max algorithm which is based on FP-Tree in running time and efficiency on mushroom,pumsb and webdocs datasets.

Key words: maximal frequent itemset, association rule, pruning technique, prefix tree, superset check

摘要: 递归遍历、条件FP-Tree构建与超集检测是多数基于FP-Tree最大频繁项集挖掘算法的主要性能瓶颈。为此,提出一种基于Nodeset的最大频繁项集挖掘算法——MFIN算法。该算法采用Nodeset数据结构对POC-Tree的节点编码,将集合枚举树作为搜索空间,避免递归遍历和条件FP-Tree构建的时间开销。设计提前停止方法提高求解Nodeset交集的效率,采用父等价剪枝技术和前瞻剪枝技术缩小搜索空间。对基于MFI-Tree的投影策略进行改进,提升超集检测的速度。实验结果表明,MFIN算法在mushroom,pumsb,webdocs数据集上的运行时间及执行效率等总体性能明显优于基于FP-Tree的FP-Max算法。

关键词: 最大频繁项集, 关联规则, 剪枝技术, 前缀树, 超集检测

CLC Number: