作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (7): 168-176,198. doi: 10.19678/j.issn.1000-3428.0062165

• 网络空间安全 • 上一篇    下一篇

混合特征选择和集成学习驱动的代码异味检测

艾成豪1, 高建华1, 黄子杰2   

  1. 1. 上海师范大学 计算机科学与技术系, 上海 200234;
    2. 华东理工大学 计算机科学与工程系, 上海 200237
  • 收稿日期:2021-07-22 修回日期:2021-09-16 出版日期:2022-07-15 发布日期:2022-07-12
  • 作者简介:艾成豪(1995—),男,硕士研究生,主研方向为Web安全测试、软件测试;高建华,教授、博士;黄子杰,博士研究生。
  • 基金资助:
    国家自然科学基金(61672355)。

Code Smell Detection Driven by Hybrid Feature Selection and Ensemble Learning

AI Chenghao1, GAO Jianhua1, HUANG Zijie2   

  1. 1. Department of Computer Science and Technology, Shanghai Normal University, Shanghai 200234, China;
    2. Department of Computer Science and Engineering, East China University of Science and Technology, Shanghai 200237, China
  • Received:2021-07-22 Revised:2021-09-16 Online:2022-07-15 Published:2022-07-12

摘要: 代码异味是违反基本设计原理或编码规范的软件特征,源代码中若存在代码异味将提高其维护的成本和难度。在代码异味检测方法中,机器学习相较其他方法能够取得更好的性能表现。针对使用大量特征进行训练可能会引起“维度灾难”以及单一模型泛化性能不佳的问题,提出一种混合特征选择和集成学习驱动的代码异味检测方法。通过ReliefF、XGBoost特征重要性和Pearson相关系数计算出所有特征的权重并进行融合,删除融合后权重值较低的无关特征,以得到特征子集。构建具有两层结构的Stacking集成学习模型,第一层的基分类器由3种不同的树模型构成,第二层以逻辑回归作为元分类器,两层结构的集成学习模型能够结合多样化模型的优点来增强泛化性能。将特征子集输入Stacking集成学习模型,从而完成代码异味分类与检测任务。实验结果表明,该方法能够减少特征维度,与Stacking集成学习模型第一层中的最优基分类器相比,其在F-measure和G-mean指标上最高分别提升1.46%和0.87%。

关键词: 代码异味, 特征选择, 集成学习, 权重融合, Stacking模型

Abstract: Code smell is a software feature that violates basic design principles or coding standards.When introduced into a source code, code smell increases the cost and difficulty of its maintenance.Machine learning can outperform other code smell detection methods.A code smell detection method based on hybrid feature selection and ensemble learning is proposed to address the possible ‘dimension disaster’ issue from the use of too many features in training, and the poor generalization performance of a single model.The weights of all features are calculated and fused using ReliefF, XGBoost feature importance, and the Pearson correlation coefficient, and irrelevant features with lower weights after fusion are deleted to obtain feature subsets. A two-layer Stacking ensemble learning model is constructed.The base classifier in the first layer comprises three different tree models, and the second layer uses Logistic Regression(LR) as a meta-classifier.The two-layer Stacking ensemble learning model combines the advantages of diversified models to enhance generalization performance.Inputting the feature subset into the Stacking ensemble learning model completes the code smell classification and detection.The experimental results show that the proposed method can reduce the feature dimension.Compared with the optimal base classifier in the first layer of the Stacking ensemble learning model, the maximum improvements in F-measure and G-mean indicators are 1.46% and 0.87%, respectively.

Key words: code smell, feature selection, ensemble learning, weight fusion, Stacking model

中图分类号: