面向对比序列模式发现的独立精确置换检验算法

doi:10.19678/j.issn.1000-3428.0058601

摘要/Abstract

摘要： 传统的对比序列模式挖掘算法存在一定数量的假阳性对比序列模式，其提供的错误信息会干扰后续任务的决策。设计一种IEP-DSP算法过滤假阳性对比序列模式。运用spade方法和WRAcc对比性度量找到候选对比序列模式和所有置换数据集合中的对比序列模式，通过模拟置换过程，使用独立精确置换检验方法为不同长度的模式建立独立精确零分布，并计算每个候选对比序列模式的精确p-value，运用错误发现率度量将各个长度的假阳性对比序列模式数量控制在置信度为α的统计显著水平下。在真实数据集和仿真数据集上的实验结果表明，IEP-DSP算法够过滤掉大量的假阳性对比序列模式，相比基于统计显著性检验的方法能保留更多的真对比序列模式，验证了独立精确置换检验相较于标准置换检验的优越性。

关键词: 数据挖掘, 模式发现, 对比序列模式挖掘, 统计显著性检验, 独立精确置换检验

Abstract: Traditional distinguishing sequential pattern mining algorithms usually generate a number of false positive patterns in their results, which hinder the subsequent decisions of tasks. To address the problem, a method named IEP-DSP for filtering out false positive patterns is proposed. The method employs the spade algorithm and the WRAcc measure to produce the distinguishing sequential patterns to be tested and the distinguishing sequential patterns that exist in permutated sequential data sets. Through the simulated permutation process, the independent exact permutation testing method is used to establish independent exact null distributions for patterns with different length, and the exact p-value of the tested patterns can be calculated from these null distributions. The False Discovery Rate(FDR) measure is used to control the number of false positive distinguishing patterns with different length under a confidence level α. The experimental results on real data sets and simulated data sets show that the IEP-DSP algorithm can eliminate a large number of false positive distinguishing patterns while keeping more real distinguishing sequential patterns. At the same time, the advantage of independent exact permutation testing over standard permutation testing is proved.

Key words: data mining, pattern discovery, distinguishing sequential pattern mining, statistical significance testing, independent exact permutation testing

中图分类号:

TP391

吴军, 欧阳艾嘉, 张琳. 面向对比序列模式发现的独立精确置换检验算法[J]. 计算机工程, 2021, 47(8): 45-53,61.

WU Jun, OUYANG Aijia, ZHANG Lin. Independent Exact Permutation Testing Algorithm for Distinguishing Sequential Pattern Discovery[J]. Computer Engineering, 2021, 47(8): 45-53,61.

https://www.ecice06.com/CN/Y2021/V47/I8/45

图/表 10

20210819194631

20210819194635

20210819194640

20210819194644

20210819194648

20210819194654

20210819194659

20210819194706

20210819194711

20210819194716

参考文献

[1] 刘睿涛, 陈左宁.基于统计数据的超级计算机内存故障分析[J]. 计算机工程, 2019, 45(5): 35-45. LIU R T, CHEN Z N.Supercomputers memory faults analysis based on statistical data[J]. Computer Engineering, 2019, 45(5): 35-45.(in Chinese)
[2] 谢彬, 张琨, 蔡颖, 等. 移动目标关联共现规则挖掘算法研究[J]. 计算机工程, 2018, 44(8): 61-67, 73. XIE B, ZHANG K, CAI Y, et al. Research on mining algorithm for association co-occurrence rule of moving targets[J]. Computer Engineering, 2018, 44(8): 61-67, 73.(in Chinese)
[3] ZHENG Z G, WEI W, LIU C M, et al. An effective contrast sequential pattern mining approach to taxpayer behavior analysis[J]. World Wide Web, 2016, 19(4): 633-651.
[4] PANG T H, DUAN L, LI L, et al. Mining similarity-aware distinguishing sequential patterns from biomedical sequences[C]//Proceedings of the 4th International Conference on Data Science in Cyberspace.Shenzhen, China:[s.n.], 2017:43-52.
[5] MICHELE D, BAIARDI F, LIPILINI J, et al. Sequential pattern mining for ICT risk assessment and management[J]. Journal of Logical and Algebraic Methods in Programming, 2019, 102(1): 1-16.
[6] 江冰, 谷飞洋, 何增有.去冗余Top-k对比序列模式挖掘[J]. 智能系统学报, 2018, 5(2): 680-686. JIANG B, GU F Y, HE Z Y.Mining top-k non-redundant distinguishing sequential patterns[J]. CAAI Transactions on Intelligent Systems, 2018, 5(2): 680-686.(in Chinese)
[7] CHAN S, KAO B, YIP C, et al. Mining emerging substrings[C]//Proceedings of the 8th International Conference on Database Systems for Advanced Applications.Kyoto, Japan:[s.n.], 2003:119-126.
[8] 王慧锋, 段磊, 左劼, 等. 免预设间隔约束的对比序列模式高效挖掘[J]. 计算机学报, 2016, 39(10): 1979-1991. WANG H F, DUAN L, ZUO J, et al. Efficient mining of distinguishing sequential patterns without a predefined gap constraint[J]. Chinese Journal of Computers, 2016, 39(10): 1979-1991.(in Chinese)
[9] HE Z Y, ZHANG S M, WU J.Significance-based discriminative sequential pattern mining[J]. Expert Systems with Applications, 2019, 122(1): 54-64.
[10] LIU G M, ZHANG H J, WONG L.Controlling false positives in association rule mining[J]. Proceedings of the VLDB Endowment, 2011, 5(2): 145-156.
[11] WU J, HE Z Y, GU F Y, et al. Computing exact permutation p-values for association rules[J]. Information Sciences, 2016, 346(1): 146-162.
[12] JUNPEI K, MASAKAZU I, HIROKI A, et al. Statistical emerging pattern mining with multiple testing correction[C]//Proceedings of the 23th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2017:897-906.
[13] PELLEGRINA L, RIONDAT M, VANDIN F.Hypothesis testing and statistically sound pattern Mining[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2019:3215-3216.
[14] BRIN S, MOTWANI R, SILVERSTEIN C.Beyond market baskets:generalizing association rules to correlations[C]//Proceedings of the 12th ACM SIGMOD International Conference on Management of Data.New York, USA:ACM Press, 1997:265-276.
[15] ZHANG H, PADMANABHAN B, TUZHILIN A.On the discovery of significant statistical quantitative rules[C]//Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2004:374-383.
[16] WEBB G I.Layered critical values:a powerful direct-adjustment approach to discovering significant patterns[J]. Machine Learning, 2008, 71(2/3): 307-323.
[17] TERADA A, KIM H, SESE J.High-speed westfall-young permutation procedure for genome-wide association studies[C]//Proceedings of the 6th ACM Conference on Bioinformatics, Computational Biology and Health Informatics.New York, USA:ACM Press, 2016:17-26.
[18] LEONARDO P, FABIO V.Efficient mining of the most significant patterns with permutation testing[C]//Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2018:2070-2079.
[19] PELLEGRINA L, RIONDAT M, VANDIN F.SPUMANTE:Significant pattern mining with unconditional testing[C]//Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2019:1528-1538.
[20] FOURNIER P, LIN J, KIRAN R, et al. A survey of sequential pattern mining[J]. Data Science and Pattern Recognition, 2017, 1(1): 54-77.
[21] CARMONA C J, JESUS M J, HERRERA F.A unifying analysis for the supervised descriptive rule discovery via the weighted relative accuracy[J]. Knowledge-Based Systems, 2018, 139(2): 89-100.
[22] DAVID R, ABBAS R.Correcting false discovery rates for their bias toward false positives[J]. Communications in Statistics-Simulation and Computation, 2019, 12(1): 1-15.
[23] ZAKI M J. SPADE:an efficient algorithm for mining frequent sequences[J]. Machine Learning, 2001, 42(1/2): 31-60.
[24] DENG K, ZAÏANE O R. An occurrence based approach to mine emerging sequences[C]//Proceedings of the 12th International Conference on Data Warehousing and Knowledge Discovery. Berlin, Germany:Springer, 2010:275-284.
[25] DUA D, GRAFF C. UCI machine learning repository[EB/OL]. [2020-05-05]. http://archive.ics.uci.edu/ml.
[26] KIM Y. Convolutional neural networks for sentence classification[C]//Proceedings of the 18th Conference on Empirical Methods in Natural Language Processing. Doha, Qatar:ACL Press, 2015:1745-1751.
[27] UNIPROT CONSORTIUM. The universal protein resource[J]. Nucleic Acids Research, 2007, 35(1): 193-197.
[28] ZHOU C, CULE B, GOETHALS B. Pattern based sequence classification[J]. IEEE Transactions on Knowledge and Data Engineering, 2015, 28(5): 1285-1298.

选择文件类型/文献管理软件名称

选择包含的内容