混合特征选择和集成学习驱动的代码异味检测

doi:10.19678/j.issn.1000-3428.0062165

摘要/Abstract

摘要： 代码异味是违反基本设计原理或编码规范的软件特征，源代码中若存在代码异味将提高其维护的成本和难度。在代码异味检测方法中，机器学习相较其他方法能够取得更好的性能表现。针对使用大量特征进行训练可能会引起“维度灾难”以及单一模型泛化性能不佳的问题，提出一种混合特征选择和集成学习驱动的代码异味检测方法。通过ReliefF、XGBoost特征重要性和Pearson相关系数计算出所有特征的权重并进行融合，删除融合后权重值较低的无关特征，以得到特征子集。构建具有两层结构的Stacking集成学习模型，第一层的基分类器由3种不同的树模型构成，第二层以逻辑回归作为元分类器，两层结构的集成学习模型能够结合多样化模型的优点来增强泛化性能。将特征子集输入Stacking集成学习模型，从而完成代码异味分类与检测任务。实验结果表明，该方法能够减少特征维度，与Stacking集成学习模型第一层中的最优基分类器相比，其在F-measure和G-mean指标上最高分别提升1.46%和0.87%。

关键词: 代码异味, 特征选择, 集成学习, 权重融合, Stacking模型

Abstract: Code smell is a software feature that violates basic design principles or coding standards.When introduced into a source code, code smell increases the cost and difficulty of its maintenance.Machine learning can outperform other code smell detection methods.A code smell detection method based on hybrid feature selection and ensemble learning is proposed to address the possible ‘dimension disaster’ issue from the use of too many features in training, and the poor generalization performance of a single model.The weights of all features are calculated and fused using ReliefF, XGBoost feature importance, and the Pearson correlation coefficient, and irrelevant features with lower weights after fusion are deleted to obtain feature subsets. A two-layer Stacking ensemble learning model is constructed.The base classifier in the first layer comprises three different tree models, and the second layer uses Logistic Regression(LR) as a meta-classifier.The two-layer Stacking ensemble learning model combines the advantages of diversified models to enhance generalization performance.Inputting the feature subset into the Stacking ensemble learning model completes the code smell classification and detection.The experimental results show that the proposed method can reduce the feature dimension.Compared with the optimal base classifier in the first layer of the Stacking ensemble learning model, the maximum improvements in F-measure and G-mean indicators are 1.46% and 0.87%, respectively.

Key words: code smell, feature selection, ensemble learning, weight fusion, Stacking model

中图分类号:

TP391

艾成豪, 高建华, 黄子杰. 混合特征选择和集成学习驱动的代码异味检测[J]. 计算机工程, 2022, 48(7): 168-176,198.

AI Chenghao, GAO Jianhua, HUANG Zijie. Code Smell Detection Driven by Hybrid Feature Selection and Ensemble Learning[J]. Computer Engineering, 2022, 48(7): 168-176,198.

https://www.ecice06.com/CN/Y2022/V48/I7/168

图/表 13

20221029174417

20221029174421

20221029174424

20221029174428

20221029174432

20221029174436

20221029174439

20221029174442

20221029174446

20221029174450

20221029174454

20221029174458

20221029174501

参考文献

[1] FOWLER M.Refactoring:improving the design of existing code[M].[S.l.]:Addison-Wesley Professional, 2018.
[2] 黄华俊, 吴海涛, 高建华, 等.消除测试代码异味对代码质量的影响分析[J].小型微型计算机系统, 2020, 41(11):2420-2426. HUANG H J, WU H T, GAO J H, et al.Measuring the impact of test smell removal on software code quality[J].Journal of Chinese Computer Systems, 2020, 41(11):2420-2426.(in Chinese)
[3] PALOMBA F, PANICHELLA A, ZAIDMAN A, et al.The scent of a smell:an extensive comparison between textual and structural smells[J].IEEE Transactions on Software Engineering, 2018, 44(10):977-1000.
[4] 章晓芳, 朱灿.代码坏味对软件演化影响的实证研究[J].软件学报, 2019, 30(5):1422-1437. ZHANG X F, ZHU C.Empirical study of code smell impact on software evolution[J].Journal of Software, 2019, 30(5):1422-1437.(in Chinese)
[5] ABUHASSAN A, ALSHAYEB M, GHOUTI L.Software smell detection techniques:a systematic literature review[J].Journal of Software:Evolution and Process, 2021, 33(3):1-48.
[6] SOBRINHO E V D P, DE LUCIA A, MAIA M D A.A systematic literature review on bad smells-5 w's:which, when, what, who, where[J].IEEE Transactions on Software Engineering, 2021, 47(1):17-66.
[7] 黄子杰, 陈军华, 高建华.检测JavaScript类的内聚耦合Code Smell[J].软件学报, 2021, 32(8):2505-2521. HUANG Z J, CHEN J H, GAO J H.Detecting coupling and cohesion Code Smell of JavaScript classes[J].Journal of Software, 2021, 32(8):2505-2521.(in Chinese)
[8] PECORELLI F, DI NUCCI D, DE ROOVER C, et al.A large empirical assessment of the role of data balancing in machine-learning-based code smell detection[J].Journal of Systems and Software, 2020, 169:110693.
[9] 黄子杰, 陈军华, 高建华.Code Smell视角下分层Web应用失血及充血现象的量化分析[J].电子学报, 2020, 48(4):772-780. HUANG Z J, CHEN J H, GAO J H.Quantifying anemia and bloodshot of layers in Web applications from the perspective of Code Smell[J].Acta Electronica Sinica, 2020, 48(4):772-780.(in Chinese)
[10] LIU H, JIN J H, XU Z F, et al.Deep learning based code smell detection[J].IEEE Transactions on Software Engineering, 2021, 47(9):1811-1837.
[11] ARCELLI FONTANA F, MÄNTYLÄ M V, ZANONI M, et al.Comparing and experimenting machine learning techniques for code smell detection[J].Empirical Software Engineering, 2016, 21(3):1143-1191.
[12] CARAM F L, DE OLIVEIRA RODRIGUES B R, CAMPANELLI A S, et al.Machine learning techniques for code smells detection:a systematic mapping study[J].International Journal of Software Engineering and Knowledge Engineering, 2019, 29(2):285-316.
[13] BOUTAIB S, BECHIKH S, PALOMBA F, et al.Code smell detection and identification in imbalanced environ-ments[J].Expert Systems with Applications, 2021, 166:114076.
[14] AGNIHOTRI M, CHUG A.Application of machine learning algorithms for code smell prediction using object-oriented software metrics[J].Journal of Statistics and Management Systems, 2020, 23(7):1159-1171.
[15] GUPTA H, KUMAR L, NETI L B M.An empirical framework for code smell prediction using extreme learning machine[C]//Proceedings of the 9th Annual Information Technology, Electromechanical Engineering and Microelec-tronics Conference.Washington D.C., USA:IEEE Press, 2019:189-195.
[16] PECORELLI F, PALOMBA F, KHOMH F, et al.Developer-driven code smell prioritization[C]//Proceedings of the 17th International Conference on Mining Software Repositories.Washington D.C., USA:IEEE Press, 2020:220-231.
[17] JAIN S, SAHA A J.Rank-based univariate feature selection methods on machine learning classifiers for code smell detection[J].Evolutionary Intelligence, 2022, 15(1):609-638.
[18] DI NUCCI D, PALOMBA F, TAMBURRI D A, et al.Detecting code smells using machine learning techniques:are we there yet?[C]//Proceedings of IEEE International Conference on Software Analysis, Evolution and Reengineering.Washington D.C., USA:IEEE Press, 2018:612-621.
[19] PALOMBA F, TAMBURRI D A.Predicting the emergence of community smells using socio-technical metrics:a machine-learning approach[J].Journal of Systems and Software, 2021, 171:110847.
[20] KHOMH F, PENTA M D, GUÉHÉNEUC Y G, et al.An exploratory study of the impact of antipatterns on class change and fault-proneness[J].Empirical Software Engineering, 2012, 17(3):243-275.
[21] BIGONHA M A S, FERREIRA K, SOUZA P, et al.The usefulness of software metric thresholds for detection of bad smells and fault prediction[J].Information and Software Technology, 2019, 115:79-92.
[22] KAUR S, MAINI R.Analysis of various software metrics used to detect bad smells[J].The International Journal of Engineering and Science, 2016, 5(6):14-20.
[23] PECORELLI F, PALOMBA F, NUCCI D D, et al.Comparing heuristic and machine learning approaches for metric-based code smell detection[C]//Proceedings of the 27th International Conference on Program Comprehension.Washington D.C., USA:IEEE Press, 2019:93-104.
[24] DANPHITSANUPHAN P, SUWANTADA T.Code smell detecting tool and code smell-structure bug relationship[C]//Proceedings of Spring Congress on Engineering and Technology.Washington D.C., USA:IEEE Press, 2012:1-5.
[25] BOLÓN-CANEDO V, ALONSO-BETANZOS A.Ensembles for feature selection:a review and future trends[J].Information Fusion, 2019, 52:1-12.
[26] ROBNIK-SIKONJA M, KONONENKO I.Theoretical and empirical analysis of ReliefF and RReliefF[J].Machine Learning, 2003, 53(1/2):23-69.
[27] XU H H, DENG Y.Dependent evidence combination based on shearman coefficient and Pearson coefficient[J].IEEE Access, 2018, 6:11634-11640.
[28] CHEN T Q, GUESTRIN C.XGBoost:a scalable tree Boosting system[C]//Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2016:785-794.
[29] 李占山, 刘兆赓.基于XGBoost的特征选择算法[J].通信学报, 2019, 40(10):101-108. LI Z S, LIU Z G.Feature selection algorithm based on XGBoost[J].Journal on Communications, 2019, 40(10):101-108.(in Chinese)
[30] NGUYEN H A T, HA LE T, BUI T D.A Stacking ensemble learning model for mental state recognition towards implementation of brain computer interface[C]//Proceedings of the 6th NAFOSTED Conference on Information and Computer Science.Washington D.C., USA:IEEE Press, 2019:39-43.
[31] WOLPERT D H.Stacked generalization[J].Neural Networks, 1992, 5(2):241-259.
[32] AMORIM L, COSTA E, ANTUNES N, et al.Experience report:evaluating the effectiveness of decision trees for detecting code smells[C]//Proceedings of IEEE International Symposium on Software Reliability Engineering.Washington D.C., USA:IEEE Press, 2015:261-269.
[33] FERENC R, TÓTH Z, LADÁNYI G, et al.A public unified bug dataset for Java[C]//Proceedings of the 14th International Conference on Predictive Models and Data Analytics in Software Engineering.Washington D.C., USA:IEEE Press, 2018:12-21.
[34] FENG Y, WANG D J, YIN Y Q, et al.An XGBoost-based casualty prediction method for terrorist attacks[J].Complex & Intelligent Systems, 2020, 6(3):721-740.
[35] YANG F, MAO K Z.Robust feature selection for microarray data based on multicriterion fusion[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2011, 8(4):1080-1092.
[36] TANG J J, LIANG J, HAN C Y, et al.Crash injury severity analysis using a two-layer Stacking framework[J].Accident, Analysis and Prevention, 2019, 122:226-238.
[37] MHAWISH M Y, GUPTA M.Predicting code smells and analysis of predictions:using machine learning techniques and software metrics[J].Journal of Computer Science and Technology, 2020, 35(6):1428-1445.
[38] 杨荣新, 孙朝云, 徐磊.基于Stacking模型融合的光伏发电功率预测[J].计算机系统应用, 2020, 29(5):36-45. YANG R X, SUN Z Y, XU L.Photovoltaic power prediction based on Stacking model fusion[J].Computer Systems & Applications, 2020, 29(5):36-45.(in Chinese)
[39] HADJ-KACEM M, BOUASSIDA N.A hybrid approach to detect code smells using deep learning[C]//Proceedings of the 13th International Conference on Evaluation of Novel Approaches to Software Engineering.Washington D.C., USA:IEEE Press, 2018:137-146.
[40] CHEN H, REN Z L, QIAO L, et al.AdaBoost-based refused bequest code smell detection with synthetic instances[C]//Proceedings of the 7th International Conference on Dependable Systems and Their Applications.Washington D.C., USA:IEEE Press, 2020:78-89.

选择文件类型/文献管理软件名称

选择包含的内容