基于哈希存储与事务加权的并行Apriori改进算法

doi:10.19678/j.issn.1000-3428.0056714

计算机工程 ›› 2020, Vol. 46 ›› Issue (11): 109-116. doi: 10.19678/j.issn.1000-3428.0056714

基于哈希存储与事务加权的并行Apriori改进算法

李洁^1,2, 朱洪亮^1,2, 陈玉玲², 辛阳^1,2

1. 北京邮电大学网络空间安全学院, 北京 100876;
2. 贵州大学贵州省公共大数据重点实验室, 贵阳 550025

收稿日期:2019-11-26 修回日期:2020-01-04 发布日期:2020-01-10
作者简介:李洁(1993-),女,硕士研究生,主研方向为网络安全、数据挖掘;朱洪亮,讲师、博士;陈玉玲,副教授;辛阳,教授。
基金资助:
国家重点研发计划（2017YFB0802300）；贵州省科技重大专项（20183001）；贵州省公共大数据重点实验室开放课题（2018BDKFJJ008，2018BDKFJJ020）。

Improved Parallel Apriori Algorithm Based on Hash Storage and Transaction Weighting

LI Jie^1,2, ZHU Hongliang^1,2, CHEN Yuling², XIN Yang^1,2

1. School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China;
2. Guizhou Provincial Key Laboratory of Public Big Data, Guizhou University, Guiyang 550025, China

Received:2019-11-26 Revised:2020-01-04 Published:2020-01-10

摘要/Abstract

摘要： Apriori算法能够挖掘事物之间的关联关系，但传统Apriori算法每计算一次候选集的支持度，都需要遍历原始事务数据库，多次扫描数据库导致其效率较低。为此，提出一种基于哈希存储与事务加权的改进算法。通过哈希存储的去重特性对事务进行去重，以减少冗余计算。将项目与项集的映射存储到哈希结构中，避免计算候选集的支持度时多次扫描事务数据库。同时开启多个线程，并行计算候选集的支持度，从而提高Apriori算法的运行效率。在开源数据集上的实验结果表明，当数据集中事务条数以及重复事务数越多时，该算法相较于传统Apriori算法的性能提升越明显，其运行时间与FP-Growth算法相近但避免了FP-Growth算法内存占用过大的问题。

关键词: 关联规则, 频繁项集, 哈希存储, 事务加权, 并行计算

Abstract: The Apriori algorithm can mine the association relationships between things,but the traditional Apriori algorithm needs to traverse the original transaction database every time the support of the candidate set is calculated,which reduces the efficiency of the algorithm.To address the problem,this paper proposes an improved algorithm based on hash storage and transaction weighting.The algorithm uses the deduplication feature of hash storage to deduplicate the transactions to reduce redundant calculations.At the same time,the mapping between the transaction set and the itemset is stored in the hash structure to avoid scanning the transaction database for multiple times during the calculation of the support of the candidate set.In addition,the support of the candidate set is calculated in parallel using multiple threads to improve the efficiency of the Apriori algorithm.Experimental results on open-source datasets show that the performance improvement of the proposed algorithm over the traditional Apriori algorithm increases with the number of transactions and repeated transactions in the dataset.Its running time is similar to that of the FP-Growth algorithm while the excessive memory consumption is avoided.

Key words: association rule, frequent itemset, hash storage, transaction weighting, parallel computing

中图分类号:

TP391

李洁, 朱洪亮, 陈玉玲, 辛阳. 基于哈希存储与事务加权的并行Apriori改进算法[J]. 计算机工程, 2020, 46(11): 109-116.

LI Jie, ZHU Hongliang, CHEN Yuling, XIN Yang. Improved Parallel Apriori Algorithm Based on Hash Storage and Transaction Weighting[J]. Computer Engineering, 2020, 46(11): 109-116.

https://www.ecice06.com/CN/Y2020/V46/I11/109

图/表 8

20201124085259

20201124085302

20201124085306

20201124085324

20201124085327

20201124085330

20201124085334

20201124085338

参考文献

[1] HAN J,KAMBER M.Data mining:concepts and techniques[M].[S.l.]:Morgan Kaufmann Publishers Inc.,2005.
[2] AGRAWAL R.Mining association rules between sets of items in large databases[EB/OL].[2019-10-10].https://cs.fit.edu/~pkc/ml/related/agrawal-sigmod93.pdf.
[3] HAN Jiawei,PEI Jian,YIN Yiwen.Mining frequent patterns without candidate generation[J].ACM SIGMOD Record,2000,29(2):1-12.
[4] WANG Ling,MENG Jianyao,XU Peipei,et al.Mining temporal association rules with frequent itemsets tree[J].Applied Soft Computing,2018,62:817-829.
[5] CHANDA A K,SAHA S,NISHI M A,et al.An efficient approach to mine flexible periodic patterns in time series databases[J].Engineering Applications of Artificial Intelligence,2015,44:46-63.
[6] WANG Feng,LI Yonghua.An improved Apriori algorithm based on the matrix[C]//Proceedings of 2008 International Seminar on Future BioMedical Information Engineering.Washington D.C.,USA:IEEE Press,2008:152-155.
[7] YANG Qinliu,FU Qunchao,WANG Cong,et al.A matrix-based Apriori algorithm improvement[C]//Proceedings of the 3rd International Conference on Data Science in Cyberspace.Washington D.C.,USA:IEEE Press,2018:824-828.
[8] VO B,LE T,COENEN F,et al.Mining frequent itemsets using the N-list and subsume concepts[J].International Journal of Machine Learning and Cybernetics,2016,7(2):253-265.
[9] PADILLO F,LUNA J M,HERRERA F,et al.Mining association rules on big data through MapReduce genetic programming[J].Integrated Computer-Aided Engineering,2017,25(1):31-48.
[10] WEN Wu,GUO Youqing.Improvement of Apriori algorithm based on genetic algorithm[J].Computer Engineering and Design,2019,40(4):1922-1926.(in Chinese)文武,郭有庆.结合遗传算法的Apriori算法改进[J].计算机工程与设计,2019,40(4):1922-1926.
[11] DENG Xiaoheng,ZENG Detian,SHEN Hailan.Causation analysis model:based on AHP and hybrid Apriori-genetic algorithm[J].Journal of Intelligent and Fuzzy Systems,2018,35(1):767-778.
[12] ZHANG R,CHEN W G,HSU T C,et al.ANG:a combination of Apriori and graph computing techniques for frequent itemsets mining[J].The Journal of Supercomputing,2019,75(2):646-661.
[13] DEAN J,GHEMAWAT S.MapReduce[J].Communi-cations of the ACM,2008,51(1):107-113.
[14] CZIBULA G,CZIBULA I G,MIHOLCA D L,et al.A novel concurrent relational association rule mining approach[J].Expert Systems with Applications,2019,125:142-156.
[15] LUNA J M,PADILLO F,PECHENIZKIY M,et al.Apriori versions based on MapReduce for mining frequent patterns on big data[J].IEEE Transactions on Cybernetics,2018,48(10):2851-2865.
[16] SINGH S,GARG R,MISHRA P K.Performance optimi-zation of MapReduce-based Apriori algorithm on Hadoop cluster[J].Computers and Electrical Engineering,2018,67:348-364.
[17] XIAO Wen,HU Juan,ZHOU Xiaofeng.PFPonCanTree:a parallel frequent patterns incremental mining algorithm based on MapReduce[J].Computer Engineering and Science,2018,40(1):15-23.(in Chinese)肖文,胡娟,周晓峰.PFPonCanTree:一种基于MapReduce的并行频繁模式增量挖掘算法[J].计算机工程与科学,2018,40(1):15-23.
[18] ZHANG Suqi,SUN Yunfei,WU Junyan,et al.A parallel frequent itemsets mining algorithm based on Spark[J].Computer Applications and Software,2019,36(2):24-28.(in Chinese)张素琪,孙云飞,武君艳,等.基于Spark的并行频繁项集挖掘算法[J].计算机应用与软件,2019,36(2):24-28.
[19] AGRAWAL J,AGRAWAL S,SINGHAI A,et al.SET-PSO-based approach for mining positive and negative association rules[J].Knowledge and Information Systems,2015,45(2):453-471.
[20] ZHENG Hui,HE Jing,HUANG Guangyan,et al.Dynamic optimisation based fuzzy association rule mining method[J].International Journal of Machine Learning and Cybernetics,2019,10(8):2187-2198.
[21] CHEN Zhipo.Data warehouse and data mining[M].Beijing:Tsinghua University Press,2009.(in Chinese)陈志泊.数据仓库与数据挖掘[M].北京:清华大学出版社,2009.
[22] QIAN Guangchao,JIA Ruiyu,ZHANG Ran,et al.One optimized method of Apriori algorithm[J].Computer Engineering,2008,34(23):196-198.(in Chinese)钱光超,贾瑞玉,张然,等.Apriori算法的一种优化方法[J].计算机工程,2008,34(23):196-198.

选择文件类型/文献管理软件名称

选择包含的内容

基于哈希存储与事务加权的并行Apriori改进算法

Improved Parallel Apriori Algorithm Based on Hash Storage and Transaction Weighting

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

[1]	张磊, 赵光岳, 肖超恩, 王建新. Falcon后量子算法的密钥树生成部件GPU并行优化设计与实现[J]. 计算机工程, 2024, 50(9): 208-215.
[2]	杨太龙, 赵红朋, 张磊. 基于国产异构平台的奇异值分解法[J]. 计算机工程, 2024, 50(9): 216-225.
[3]	雷斗威, 何德彪, 罗敏, 彭聪. 基于AVX512的格密码高速并行实现[J]. 计算机工程, 2024, 50(2): 15-24.
[4]	王其涵, 庞建民, 岳峰, 祝迪, 沈莉, 肖谦. 面向申威架构的KNN并行算法实现与优化[J]. 计算机工程, 2023, 49(5): 286-294.
[5]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[6]	房俊, 薛晓东, 周云亮. 基于深度生成模型的聚合查询区间估计方法[J]. 计算机工程, 2023, 49(11): 284-292, 301.
[7]	钱龙, 赵静, 韩京宇, 毛毅. 基于标签相关性的K近邻多标签学习[J]. 计算机工程, 2022, 48(6): 73-78,88.
[8]	赵欣灿, 朱云, 毛伊敏. 基于MapReduce的高维数据频繁项集挖掘[J]. 计算机工程, 2022, 48(3): 81-89.
[9]	王璐, 刘晓清, 何震瀛. 连续时间区间内的频繁词序列挖掘算法[J]. 计算机工程, 2022, 48(2): 79-85,91.
[10]	黄瑞, 金光浩, 李磊, 姜文超, 宋庆增. 轻量化神经网络加速器的设计与实现[J]. 计算机工程, 2021, 47(9): 185-190,196.
[11]	易培淮, 李卫东, 林韬, 邹佳恒, 邓子艳, 刘言. GPU在缪子快速模拟中的应用[J]. 计算机工程, 2021, 47(8): 100-108.
[12]	佘鑫, 何震瀛. 复杂属性条件下基于Spark的clique社区搜索算法[J]. 计算机工程, 2021, 47(12): 54-61,70.
[13]	刘治国, 蔡文珠, 李运琪, 潘成胜. 基于序列统计的未知无线协议特征提取方法[J]. 计算机工程, 2021, 47(11): 192-197.
[14]	郭渝洛, 边浩东, 董润婷, 唐嘉豪, 王晓英, 黄建强. 基于SIMD的并行傅里叶空间图像相似度计算[J]. 计算机工程, 2021, 47(11): 247-253.
[15]	王斌, 房新秀, 魏天佑. 基于差异节点集的加权频繁项集挖掘算法[J]. 计算机工程, 2020, 46(5): 150-156.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于哈希存储与事务加权的并行Apriori改进算法

Improved Parallel Apriori Algorithm Based on Hash Storage and Transaction Weighting

RichHTML

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

图/表 8

参考文献

相关文章 15

编辑推荐

Metrics

本文评价