作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (3): 1-9. doi: 10.19678/j.issn.1000-3428.0063249

• 热点与综述 • 上一篇    下一篇

基于袋外预测和扩展空间的随机森林改进算法

常硕1, 张彦春2   

  1. 1. 复旦大学 计算机科学技术学院, 上海 200082;
    2. 广州大学 网络空间先进技术研究院, 广州 510006
  • 收稿日期:2021-11-16 修回日期:2022-01-12 发布日期:2021-12-15
  • 作者简介:常硕(1994-),男,硕士研究生,主研方向为健康大数据、人工智能;张彦春,教授、博士、博士生导师。
  • 基金资助:
    国家自然科学基金(61672161)。

Improved Random Forest Algorithm Based on Out-of-Bag Prediction and Extended Space

CHANG Shuo1, ZHANG Yanchun2   

  1. 1. School of Computer Science, Fudan University, Shanghai 200082, China;
    2. Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China
  • Received:2021-11-16 Revised:2022-01-12 Published:2021-12-15

摘要: 随机森林在bootstrap的基础上通过对特征进行抽样构建决策树,以牺牲决策树准确性的方式来降低决策树间的相关性,从而提高预测的准确性。但在数据规模较大时,决策树间的相关性仍然较高,导致随机森林的性能表现不佳。为解决该问题,提出一种基于袋外预测的改进算法,通过提高决策树的准确性来提升随机森林的预测性能。将随机森林的袋外预测与原特征相结合并重新训练随机森林,以有效降低决策树的VC-dimension、经验风险、泛化风险并提高其准确性,最终提升随机森林的预测性能。然而,决策树准确性的提高会使决策树间的预测趋于相近,提升了决策树间的相关性从而影响随机森林最终的预测表现,为此,通过扩展空间算法为不同决策树生成不同的特征,从而降低决策树间的相关性而不显著降低决策树的准确性。实验结果表明,该算法在32个数据集上的平均准确率相对原始随机森林提高1.7%,在校正的paired t-test上,该方法在其中19个数据集上的预测性能显著优于原始随机森林。

关键词: 随机森林, 袋外预测, 扩展空间, 相关性, 决策树

Abstract: On the basis of the bootstrap method, the random forest algorithm constructs a decision tree by using sampling characteristics.This reduces the correlation among decision trees at the expense of decision tree accuracy, thereby improving the prediction accuracy.However, when the data scale is large, the correlation among the decision trees remains high, causing the random forest algorithm to perform poorly.To solve this problem, an improved algorithm based on out-of-bag prediction is proposed to improve the prediction performance by increasing the accuracy of the decision tree.The out-of-bag prediction of the random forest algorithm is combined with the original characteristics, and the random forest algorithm is retrained to reduce the VC-dimension, empirical risk, and generalization risk of the decision tree, as well as to improve its accuracy and the prediction performance of the random forest approach. However, the improvement in the accuracy of the decision tree makes the predictions of the decision trees more similar, improves the correlation among the decision trees, and thus affects the final prediction performance of the random forest algorithm.Therefore, an extended space algorithm is used to generate different features for different decision trees to reduce the correlation among the decision trees, without significantly reducing their accuracy.Consequently, the prediction performance of the random forest algorithm is improved.Experimental results show that the average accuracy of the algorithm on 32 datasets is 1.7% higher than that of the original random forest algorithm.In the corrected paired t-test, the prediction performance of the proposed algorithm on 19 datasets is significantly better than that of the original algorithm.

Key words: random forest, out-of-bag prediction, extend space, correlation, decision tree

中图分类号: