计算机工程 ›› 2020, Vol. 46 ›› Issue (11): 109-116.doi: 10.19678/j.issn.1000-3428.0056714

• 先进计算与数据处理 • 上一篇    下一篇

基于哈希存储与事务加权的并行Apriori改进算法

李洁1,2, 朱洪亮1,2, 陈玉玲2, 辛阳1,2   

  1. 1. 北京邮电大学 网络空间安全学院, 北京 100876;
    2. 贵州大学 贵州省公共大数据重点实验室, 贵阳 550025
  • 收稿日期:2019-11-26 修回日期:2020-01-04 发布日期:2020-01-10
  • 作者简介:李洁(1993-),女,硕士研究生,主研方向为网络安全、数据挖掘;朱洪亮,讲师、博士;陈玉玲,副教授;辛阳,教授。
  • 基金项目:
    国家重点研发计划(2017YFB0802300);贵州省科技重大专项(20183001);贵州省公共大数据重点实验室开放课题(2018BDKFJJ008,2018BDKFJJ020)。

Improved Parallel Apriori Algorithm Based on Hash Storage and Transaction Weighting

LI Jie1,2, ZHU Hongliang1,2, CHEN Yuling2, XIN Yang1,2   

  1. 1. School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China;
    2. Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China
  • Received:2019-11-26 Revised:2020-01-04 Published:2020-01-10

摘要: Apriori算法能够挖掘事物之间的关联关系,但传统Apriori算法每计算一次候选集的支持度,都需要遍历原始事务数据库,多次扫描数据库导致其效率较低。为此,提出一种基于哈希存储与事务加权的改进算法。通过哈希存储的去重特性对事务进行去重,以减少冗余计算。将项目与项集的映射存储到哈希结构中,避免计算候选集的支持度时多次扫描事务数据库。同时开启多个线程,并行计算候选集的支持度,从而提高Apriori算法的运行效率。在开源数据集上的实验结果表明,当数据集中事务条数以及重复事务数越多时,该算法相较于传统Apriori算法的性能提升越明显,其运行时间与FP-Growth算法相近但避免了FP-Growth算法内存占用过大的问题。

关键词: 关联规则, 频繁项集, 哈希存储, 事务加权, 并行计算

Abstract: The Apriori algorithm can mine the association relationships between things,but the traditional Apriori algorithm needs to traverse the original transaction database every time the support of the candidate set is calculated,which reduces the efficiency of the algorithm.To address the problem,this paper proposes an improved algorithm based on hash storage and transaction weighting.The algorithm uses the deduplication feature of hash storage to deduplicate the transactions to reduce redundant calculations.At the same time,the mapping between the transaction set and the itemset is stored in the hash structure to avoid scanning the transaction database for multiple times during the calculation of the support of the candidate set.In addition,the support of the candidate set is calculated in parallel using multiple threads to improve the efficiency of the Apriori algorithm.Experimental results on open-source datasets show that the performance improvement of the proposed algorithm over the traditional Apriori algorithm increases with the number of transactions and repeated transactions in the dataset.Its running time is similar to that of the FP-Growth algorithm while the excessive memory consumption is avoided.

Key words: association rule, frequent itemset, hash storage, transaction weighting, parallel computing

中图分类号: