面向大型数据集的高效决策树参数剪枝算法

doi:10.19678/j.issn.1000-3428.0066519

摘要/Abstract

摘要：

决策树在数据分类上具有较好的效果，但容易产生过拟合的现象，解决方案是对决策树进行剪枝处理，然而传统剪枝算法普遍存在预剪枝容易欠拟合、后剪枝时间消耗多、网络搜索剪枝仅适用于小型数据集等问题。为了解决以上问题，提出一种高效的决策树参数剪枝算法。根据网络安全态势感知模型，建立剪枝决策树态势感知系统架构，分析网络数据流。在生成决策树的过程中，利用枚举与二分搜索算法找出决策树最大深度，采用深度优先搜索算法找到节点最小分裂数和最大特征数，最终结合这3个最优参数自上而下完成剪枝。实验结果表明，所提算法在大型数据集上的过拟合风险较小，训练集与测试集准确率都在95%以上，同时相比于后剪枝算法中表现较好的悲观错误剪枝算法快了近20倍。

关键词: 决策树, 剪枝, 过拟合, 安全态势感知, 泛化性

Abstract:

Decision tree(DT) have a good effect on data classification but easily develop overfitting. The solution to this problem is to prune the DT. However, the pruning algorithm has shortcomings; for example, prepruning is prone to underfitting, the postpruning time is extended, and Web-search pruning is only suitable for small datasets. This study proposes an efficient parameter-pruning algorithm for the DT to solve the above problems. Based on the network security situation awareness model, the architecture of the pruned decision-tree situation awareness system is established, and the data flow of the network is analyzed. During the process of generating the DT, enumeration and binary search algorithms are used to determine the maximum depth of the DT, and a depth-first search algorithm is used to determine the minimum number of split nodes and the maximum number of features. Finally, the three optimal parameters are combined to complete the pruning from top to bottom. The experimental results show that this algorithm has a low risk of overfitting in large datasets. The accuracy of the training and testing sets exceed 95%. Compared to the Pessimistic Error-Pruning(PEP) algorithm that exhibits the best performance in post-pruning algorithms, the pruning algorithm is almost 20 times faster.

Key words: Decision Tree(DT), pruning, overfitting, security situational awareness, generalization

谢兆贤, 邹兴敏, 张文静. 面向大型数据集的高效决策树参数剪枝算法[J]. 计算机工程, 2024, 50(1): 156-165.

Zhaoxian XIE, Xingmin ZOU, Wenjing ZHANG. High-Efficient Parameter-Pruning Algorithm of Decision Tree for Large Dataset[J]. Computer Engineering, 2024, 50(1): 156-165.

https://www.ecice06.com/CN/Y2024/V50/I1/156

图/表 18

图1 网络安全态势感知系统架构

Fig.1 Architecture of network security situational awareness system

图2 剪枝决策树态势感知系统架构

Fig.2 Architecture of pruning decision tree situation awareness system

图3 PAP算法流程

Fig.3 Procedure of PAP algorithm

图4 数据平衡前后对比

Fig.4 Comparison before and after data balance

图5 SMOTE算法模型

Fig.5 Model of SMOTE algorithm

图6 样本信息增益

Fig.6 Sample information gain

图7 未剪枝前模型拟合程度

Fig.7 Fitting degree of the model before pruning

图8 最优深度下的训练集准确率

Fig.8 Accuracy of training set under the optimal depth

图9 最小分裂样本数和最大特征数下的准确率

Fig.9 Accuracy under the minimum split sample number and maximum feature number

图10 最优参数下的拟合情况

Fig.10 Fitting situation under the optimal parameters

图11 3种剪枝算法的拟合程度

Fig.11 Fitting degree of three pruning algorithms

参考文献 25

1	龚俭, 臧小东, 苏琪, 等. 网络安全态势感知综述. 软件学报, 2017, 28(4): 1010- 1026.
	GONG J, ZANG X D, SU Q, et al. Survey of network security situation awareness. Journal of Software, 2017, 28(4): 1010- 1026.
2	刘龙霞, 吴军华. 基于分类树和贪心算法的测试数据自动生成方法. 计算机工程与设计, 2011, 32(8): 2734-2736, 2820.
	LIU L X, WU J H. Automated test data generation method based on classification-tree and greedy algorithm. Computer Engineering and Design, 2011, 32(8): 2734-2736, 2820.
3	BRUNELLO A, MARZANO E, MONTANARI A, et al. Decision tree pruning via multi-objective evolutionary computation. International Journal of Machine Learning and Computing, 2017, 7(6): 167- 175. doi: 10.18178/ijmlc.2017.7.6.641
4	焦亚男, 马杰. 一种改进的MEP决策树剪枝算法. 河北工业大学学报, 2019, 48(6): 24- 30.
	JIAO Y N, MA J. An improved MEP decision tree pruning algorithm. Journal of Hebei University of Technology, 2019, 48(6): 24- 30.
5	郑伟, 马楠. 一种改进的决策树后剪枝算法. 计算机与数字工程, 2015, 43(6): 960-966, 971.
	ZHENG W, MA N. An improved post-pruning algorithm for decision tree. Computer & Digital Engineering, 2015, 43(6): 960-966, 971.
6	周莉, 李静毅. 基于决策树算法的联级网络安全态势感知模型. 计算机仿真, 2021, 38(5): 264- 268.
	ZHOU L, LI J Y. Security situation awareness model of joint network based on decision tree algorithm. Computer Simulation, 2021, 38(5): 264- 268.
7	宋万洋, 李国和, 吴卫江, 等. 基于平衡准确率和规模的决策树剪枝算法. 科学技术与工程, 2016, 16(16): 79- 82.
	SONG W Y, LI G H, WU W J, et al. Pruning algorithm of decision tree by balance of accuracy and size. Science Technology and Engineering, 2016, 16(16): 79- 82.
8	MALIK A J, KHAN F A. A hybrid technique using binary particle swarm optimization and decision tree pruning for network intrusion detection. Cluster Computing, 2018, 21(1): 667- 680. doi: 10.1007/s10586-017-0971-8
9	SAWANT S S, WIEDMANN M, GÖB S, et al. Compression of deep convolutional neural network using additional importance-weight-based filter pruning approach. Applied Sciences, 2022, 12(21): 11184. doi: 10.3390/app122111184
10	于安池, 储茂祥, 杨永辉, 等. 具有强化学习策略的决策树算法. 合肥工业大学学报(自然科学版), 2021, 44(5): 616- 620.
	YU A C, CHU M X, YANG Y H, et al. Decision tree algorithm with reinforcement learning strategy. Journal of Hefei University of Technology(Natural Science), 2021, 44(5): 616- 620.
11	吕高锋, 谭靖, 乔冠杰, 等. 决策树报文分类算法. 国防科技大学学报, 2022, 44(3): 184- 193.
	LÜ G F, TAN J, QIAO G J, et al. Decision tree packet classification algorithm. Journal of National University of Defense Technology, 2022, 44(3): 184- 193.
12	MA L, XIAO H, TAO J, et al. An intelligent approach for reservoir quality evaluation in tight sandstone reservoir using gradient boosting decision tree algorithm. Open Geosciences, 2022, 14(1): 629- 645. doi: 10.1515/geo-2022-0354
13	CAO Y, WEI W, ZHOU J. Privacy protection data mining algorithm in blockchain based on decision tree classification. Web Intelligence, 2022, 20(2): 103- 112. doi: 10.3233/WEB-210485
14	PASHAMOKHTARI A, BATISTA G, HABIBI-GHARAKHEILI H. AdIo Tack: quantifying and refining resilience of decision tree ensemble inference models against adversarial volumetric attacks on IoT networks. Computers & Security, 2022, 120, 102801.
15	CAMPBELL T W, RODER H, GEORGANTAS R W III, et al. Exact Shapley values for local and model-true explanations of decision tree ensembles. Machine Learning with Applications, 2022, 9, 100345. doi: 10.1016/j.mlwa.2022.100345
16	CHEW Y J, OOI S Y, WONG K S, et al. Adoption of IP truncation in a privacy-based decision tree pruning design: a case study in network intrusion detection system. Electronics, 2022, 11(5): 805. doi: 10.3390/electronics11050805
17	MEI S, MONTANARI A. The generalization error of random features regression: precise asymptotics and the double descent curve. Communications on Pure and Applied Mathematics, 2021, 75(4): 1- 15.
18	程家根, 祁正华, 陈天赋. 基于RBF神经网络的网络安全态势感知. 南京邮电大学学报(自然科学版), 2019, 39(4): 88- 95.
	CHENG J G, QI Z H, CHEN T F. Network security situation awareness based on RBF neural networks. Journal of Nanjing University of Posts and Telecommunications (Natural Science Edition), 2019, 39(4): 88- 95.
19	ZHANG J N. Network security situational awareness based on genetic algorithm in wireless sensor networks[J/OL]. Journal of Sensors: 8292920[2022-11-12].https://doi.org/10.1155/2022/8292920.
20	MORDVANYUK N, BIFET A, LÓPEZ B. VEPRECO: vertical databases with pre-pruning strategies and common candidate selection policies to fasten sequential pattern mining. Expert Systems with Applications, 2022, 204, 117517. doi: 10.1016/j.eswa.2022.117517
21	NAND S, JOSHUA M. CausNet: generational orderings based search for optimal Bayesian networks via dynamic programming with parent set constraints. BMC Bioinformatics, 2023, 24(1): 46. doi: 10.1186/s12859-023-05159-6
22	ZHANG S Z, WANG Y C, LV Q. Exploring artificial intelligence architecture in data cleaning based on Bayesian networks[EB/OL]. [2022-11-12].https://dl.acm.org/doi/10.1155/2022/6731781.
23	马红明, 马浩, 杨迪, 等. 基于奇异阈值理论的电力营销数据在线清理方法[J/OL]. 电测与仪表: 1-9[2022-11-12].https://kns.cnki.net/kcms/detail/23.1202.TH.20211213.1353.004.html.
	MA H M, MA H, YANG D, et al. Online cleaning method of electric power marketing data based on singular threshold theory[J/OL]. Electrical Measurement and Instrumentation: 1-9[2022-11-12].https://kns.cnki.net/kcms/detail/23.1202.TH.20211213.1353.004.html.(in Chinese)
24	王曜, 郑列. 一种新的基于聚类的试探性SMOTE算法. 重庆理工大学学报(自然科学版), 2022, 36(4): 187- 195.
	WANG Y, ZHENG L. New tentative SMOTE algorithm based on clustering. Journal of Chongqing University of Technology (Natural Science), 2022, 36(4): 187- 195.
25	RAO S W, ZOU G P, YANG S Y, et al. Fault diagnosis of power transformers using ANN and SMOTE algorithm. International Journal of Applied Electromagnetics and Mechanics, 2022, 70(4): 345- 355. doi: 10.3233/JAE-210227

[1]	高伟, 李帅龙, 茆琳, 王磊, 李颖颖, 韩林. 一种基于TVM的算子生成加速策略[J]. 计算机工程, 2024, 50(8): 353-362.
[2]	翟洁, 李艳豪, 李彬彬, 郭卫斌. 基于大语言模型的个性化实验报告评语自动生成与应用[J]. 计算机工程, 2024, 50(7): 42-52.
[3]	翟洁, 李艳豪, 李彬彬, 郭卫斌. 基于大语言模型的个性化实验报告评语自动生成与应用[J]. 计算机工程, 2024, 50(7): 42-52.
[4]	付嘉豪, 杨嘉怡, 李爱国. 面向安防系统的高效用语义轨迹模式挖掘[J]. 计算机工程, 2023, 49(6): 62-70.
[5]	马嘉翔, 宋晓宁. 基于彩票假设的软剪枝算法[J]. 计算机工程, 2023, 49(5): 97-104.
[6]	王博, 张远, 杨咏蓓. 基于模仿学习的决策树码率自适应算法研究[J]. 计算机工程, 2023, 49(5): 206-214.
[7]	杜明, 郝燕, 周军锋, 谭玉婷. 一种高效的周期团挖掘方法[J]. 计算机工程, 2023, 49(4): 68-76.
[8]	安志国, 彭政, 易满成, 刘健欣, 俞思帆. 神经网络滤波器竞争训练[J]. 计算机工程, 2023, 49(4): 120-124.
[9]	程小辉, 李钰, 康燕萍. 基于中间图特征提取的卷积网络双标准剪枝[J]. 计算机工程, 2023, 49(3): 105-112.
[10]	林洪秀, 邢长友, 詹熙. 对抗多模式网络层析成像的拓扑混淆机制[J]. 计算机工程, 2023, 49(12): 282-293, 303.
[11]	甘红楠, 张凯. 参数自适应下基于近邻图的近似最近邻搜索[J]. 计算机工程, 2022, 48(9): 28-36.
[12]	王国栋, 叶剑, 谢萦, 钱跃良. 基于梯度的自适应阈值结构化剪枝算法[J]. 计算机工程, 2022, 48(9): 113-120.
[13]	黎浩民, 李光平. 基于稀疏神经网络的图像超分辨率重建算法[J]. 计算机工程, 2022, 48(7): 247-253.
[14]	刘蒙蒙, 牛保宁, 杨茸. 关键词最优路径查询的分段拓展算法[J]. 计算机工程, 2022, 48(6): 79-88.
[15]	房志远, 石守东, 郑佳罄, 胡加钿. 融合弱层惩罚的卷积神经网络模型剪枝方法[J]. 计算机工程, 2022, 48(5): 67-73.

选择文件类型/文献管理软件名称

选择包含的内容