摘要: 不确定性数据挖掘是数据挖掘领域的研究热点,但其应用于最大频繁项集的算法较少。根据不确定数据挖掘的特点,把挖掘确定性数据最大频繁模式的GenMax算法扩展到不确定数据中,提出一种U-GenMax算法。对Tid集进行扩展,在id域的基础上增加概率域,实现垂直数据格式转换。在频繁项集判断方面加入前置判断来剪枝非频繁项集,相比直接计算置信度的方式,降低了计算量。基于栈式结构给出多步回退剪枝新策略,从而避免GenMax算法只能单步回退的缺陷。实验结果证明,该算法计算性能良好,可适用于各种情况下的稀疏数据集与支持度较高情况下的稠密数据集。
关键词:
不确定数据,
频繁项集,
最大模式,
垂直格式,
剪枝策略,
置信度
Abstract: The research on uncertain data mining becomes a hotspot in the area of data mining recently.However,there are few algorithms which can be used to mine maximal frequent itemsets.Based on features of uncertain data,this paper proposes a new U-GenMax algorithm which improves and extends the maximal pattern mining algorithm GenMax from deterministic data to uncertain data.The algorithm extends the Tid set and adds probabilistic domain to the id domain,and realizes format converting of vertical data.In the aspect of judging frequent itemsets,the algorithm adds two prior judgments to prune infrequent itemsets,and lowers the amount of calculation enormously compared with calculating confidence level directly.The algorithm proposes a new multistep rollback pruning strategy,thus avoids the flaw of GenMax which only rolls back one step at a time.Experimental results show that the performance of U-GenMax is very good and suitable for sparse database under all circumstances as well as dense database under high degree of support.
Key words:
uncertain data,
frequent itemset,
maximal pattern,
vertical format,
pruning strategy,
confidence
中图分类号: