作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

一种挖掘不确定数据最大模式的深度优先算法

李雨明 1,邱卫东 1,徐赛赛 2,郭英凯 3   

  1. (1.上海交通大学信息安全工程学院,上海 200240; 2.上海市公安局网络安全保卫大队,上海 200000;3.北京北信源软件股份有限公司,北京 100081)
  • 收稿日期:2014-07-14 出版日期:2015-07-15 发布日期:2015-07-15
  • 作者简介:李雨明(1988-),男,硕士研究生,主研方向:数据挖掘,分布式系统;邱卫东,教授;徐赛赛、郭英凯,工程师。
  • 基金资助:
    国家科技部科技支撑计划基金资助项目(2011BAK13B05);教育部新世纪优秀人才计划基金资助项目(NCET-12-0358);上海市科委科研创新基金资助重点项目(12ZZ019);上海市科技计划基金资助项目(13JG0500400)。

A Depth-first Algorithm for Mining Uncertain Data Maximal Model

LI Yuming 1,QIU Weidong 1,XU Saisai 2,GUO Yingkai 3   

  1. (1.School of Information Security and Engineering,Shanghai Jiaotong University,Shanghai 200240,China;2.Network Security Defense Brigade of Shanghai Public Security Bureau,Shanghai 200000,China;3.Beijing VRV Software Corporation Limited,Beijing 100081,China)
  • Received:2014-07-14 Online:2015-07-15 Published:2015-07-15

摘要: 不确定性数据挖掘是数据挖掘领域的研究热点,但其应用于最大频繁项集的算法较少。根据不确定数据挖掘的特点,把挖掘确定性数据最大频繁模式的GenMax算法扩展到不确定数据中,提出一种U-GenMax算法。对Tid集进行扩展,在id域的基础上增加概率域,实现垂直数据格式转换。在频繁项集判断方面加入前置判断来剪枝非频繁项集,相比直接计算置信度的方式,降低了计算量。基于栈式结构给出多步回退剪枝新策略,从而避免GenMax算法只能单步回退的缺陷。实验结果证明,该算法计算性能良好,可适用于各种情况下的稀疏数据集与支持度较高情况下的稠密数据集。

关键词: 不确定数据, 频繁项集, 最大模式, 垂直格式, 剪枝策略, 置信度

Abstract: The research on uncertain data mining becomes a hotspot in the area of data mining recently.However,there are few algorithms which can be used to mine maximal frequent itemsets.Based on features of uncertain data,this paper proposes a new U-GenMax algorithm which improves and extends the maximal pattern mining algorithm GenMax from deterministic data to uncertain data.The algorithm extends the Tid set and adds probabilistic domain to the id domain,and realizes format converting of vertical data.In the aspect of judging frequent itemsets,the algorithm adds two prior judgments to prune infrequent itemsets,and lowers the amount of calculation enormously compared with calculating confidence level directly.The algorithm proposes a new multistep rollback pruning strategy,thus avoids the flaw of GenMax which only rolls back one step at a time.Experimental results show that the performance of U-GenMax is very good and suitable for sparse database under all circumstances as well as dense database under high degree of support.

Key words: uncertain data, frequent itemset, maximal pattern, vertical format, pruning strategy, confidence

中图分类号: