作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (1): 156-165. doi: 10.19678/j.issn.1000-3428.0066519

• 网络空间安全 • 上一篇    下一篇

面向大型数据集的高效决策树参数剪枝算法

谢兆贤, 邹兴敏*(), 张文静   

  1. 曲阜师范大学网络空间安全学院, 山东 曲阜 273165
  • 收稿日期:2022-12-13 出版日期:2024-01-15 发布日期:2024-01-11
  • 通讯作者: 邹兴敏
  • 基金资助:
    山东省自然科学基金面上项目(ZR2020MF048)

High-Efficient Parameter-Pruning Algorithm of Decision Tree for Large Dataset

Zhaoxian XIE, Xingmin ZOU*(), Wenjing ZHANG   

  1. School of Cyber Science and Engineering, Qufu Normal University, Qufu 273165, Shandong, China
  • Received:2022-12-13 Online:2024-01-15 Published:2024-01-11
  • Contact: Xingmin ZOU

摘要:

决策树在数据分类上具有较好的效果,但容易产生过拟合的现象,解决方案是对决策树进行剪枝处理,然而传统剪枝算法普遍存在预剪枝容易欠拟合、后剪枝时间消耗多、网络搜索剪枝仅适用于小型数据集等问题。为了解决以上问题,提出一种高效的决策树参数剪枝算法。根据网络安全态势感知模型,建立剪枝决策树态势感知系统架构,分析网络数据流。在生成决策树的过程中,利用枚举与二分搜索算法找出决策树最大深度,采用深度优先搜索算法找到节点最小分裂数和最大特征数,最终结合这3个最优参数自上而下完成剪枝。实验结果表明,所提算法在大型数据集上的过拟合风险较小,训练集与测试集准确率都在95%以上,同时相比于后剪枝算法中表现较好的悲观错误剪枝算法快了近20倍。

关键词: 决策树, 剪枝, 过拟合, 安全态势感知, 泛化性

Abstract:

Decision tree(DT) have a good effect on data classification but easily develop overfitting. The solution to this problem is to prune the DT. However, the pruning algorithm has shortcomings; for example, prepruning is prone to underfitting, the postpruning time is extended, and Web-search pruning is only suitable for small datasets. This study proposes an efficient parameter-pruning algorithm for the DT to solve the above problems. Based on the network security situation awareness model, the architecture of the pruned decision-tree situation awareness system is established, and the data flow of the network is analyzed. During the process of generating the DT, enumeration and binary search algorithms are used to determine the maximum depth of the DT, and a depth-first search algorithm is used to determine the minimum number of split nodes and the maximum number of features. Finally, the three optimal parameters are combined to complete the pruning from top to bottom. The experimental results show that this algorithm has a low risk of overfitting in large datasets. The accuracy of the training and testing sets exceed 95%. Compared to the Pessimistic Error-Pruning(PEP) algorithm that exhibits the best performance in post-pruning algorithms, the pruning algorithm is almost 20 times faster.

Key words: Decision Tree(DT), pruning, overfitting, security situational awareness, generalization