计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于矩阵的数据流Top-k频繁项集挖掘算法

尹绍宏,范桂丹   

  1. (天津工业大学计算机科学与软件学院,天津 300387)
  • 收稿日期:2013-01-11 出版日期:2014-03-15 发布日期:2014-03-13
  • 作者简介:尹绍宏(1964-),女,副教授,主研方向:数据挖掘;范桂丹,硕士研究生。

Top-k Frequent Itemsets Mining Algorithm over Data Streams Based on Matrix

YIN Shao-hong, FAN Gui-dan   

  1. (School of Computer Science and Software Engineering, Tianjin Polytechnic University, Tianjin 300387, China)
  • Received:2013-01-11 Online:2014-03-15 Published:2014-03-13

摘要: 传统的数据挖掘算法在挖掘频繁项集时会产生大量的冗余项集,影响挖掘效率。为此,提出一种基于矩阵的数据流Top-k频繁项集挖掘算法。引入2个0-1矩阵,即事务矩阵和二项集矩阵。采用事务矩阵表示滑动窗口模型中的事务列表,通过计算每行的支持度得到二项集矩阵。利用二项集矩阵得到候选项集,将事务矩阵中对应的行做逻辑与运算,计算出候选项集的支持度,从而得到Top-k频繁项集。把挖掘的结果存入数据字典中,当用户查询时,能够按支持度降序输出Top-k频繁项集。实验结果表明,该算法在挖掘过程中能避免冗余项集的产生,在保证正确率的前提下具有较高的时间效率。

关键词: 数据挖掘, 数据流, 滑动窗口, 矩阵, Top-k频繁项集

Abstract: The past algorithms produce large amounts of redundant itemsets, and they affect the efficiency of data mining. Therefore, a Top-k frequent itemsets mining algorithm over data streams based on matrix is proposed. Two 0-1 matrices, transaction matrix and 2-itemsets matrix, are introduced into the algorithm. Using transaction matrix to express the transaction list of a sliding window, and 2-itemsets matrix is obtained by calculating the support of each row. Then it can get candidate items by 2-itemsets matrix, and Top-k frequent itemsets are obtained by calculating the support of candidate items through logic and operation of correspond row in transaction matrix. Finally it saves the result of data mining into data dictionary. The algorithm can output the Top-k frequent itemsets by support in descendant order when user queries. Experimental results show that the algorithm avoids redundant itemsets in the process of data mining, and the efficiency of data mining is improved appreciably under the premise of accuracy.

Key words: data mining, data stream, sliding window, matrix, Top-k frequent itemset

中图分类号: