A malicious code classification method based on the fusion of LightGBM and multidimensional features

doi:10.19678/j.issn.1000-3428.0252241

Abstract

Abstract: n the field of computer security, malicious code protection has always been an important research topic. With the rapid development of computer technology, the types and forms of malicious code are constantly evolving. Traditional feature engineering methods have a single feature dimension when dealing with complex malicious samples, resulting in insufficient representation ability and the inability to accurately identify various types of malicious code. Other malicious code classification methods based on feature fusion rely on expert experience to manually design features during the feature extraction process. Moreover, multimodal deep learning models have insufficient interpretability and high computational costs.To address these issues, this paper proposes an innovative feature fusion method, which is applied to the classification of malicious code in Windows PE files. By integrating behavioral features, structural features, and texture features, and using LightGBM as the classifier, the classification of malicious code is completed. The experimental results show that the proposedmethod achieves a test accuracy of 99.90% and a log loss (Logloss) of 0.0057 on the Microsoft Malware Classification Challenge dataset, and a test accuracy of 98.97% and a log loss of 0.042 on the Bazaar dataset.The experimental results demonstrate that this method can comprehensively and accurately represent malicious code, and it has important theoretical significance and practical application value. By fusing multi-dimensional features, this method provides an effective solution for malicious code detection and has broad application prospects.

摘要： 在计算机安全领域，恶意代码防护一直是计算机安全领域的重要研究课题。随着计算机技术的快速发展，恶意代码的种类和形式不断演变，传统特征工程方法在处理复杂恶意样本时特征维度单一，致使表征能力不足，无法精准识别各类恶意代码。其他基于特征融合的恶意代码分类方法特征提取过程依赖专家经验手工进行特征设计，而多模态深度学习模型可解释性不足，计算开销大。为此，本文提出了一种创新的特征融合方法，该方法应用于Windows PE文件的恶意代码分类，通过整合行为特征、结构特征及纹理特征，并采用LightGBM作为分类器完成对恶意代码的分类。实验结果表明，该方法在Microsoft恶意软件分类挑战赛数据集上的测试准确率为99.90%，对数损失（Logloss）为0.0057，在Bazaar数据集上的测试准确率为98.97%，对数损失为0.042。实验结果显示这一方法能够全面、准确地表征恶意代码，具有重要的理论意义和实际应用价值。通过融合多维特征，该方法为恶意代码检测提供了一种有效的解决方案，具有广阔的应用前景。

Shuai Feng , Jian Gao. A malicious code classification method based on the fusion of LightGBM and multidimensional features[J]. Computer Engineering, doi: 10.19678/j.issn.1000-3428.0252241.

冯帅 , 高见. 基于多维特征融合的恶意代码分类方法[J]. 计算机工程, doi: 10.19678/j.issn.1000-3428.0252241.

/ Recommend / Download Citations

URL: https://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0252241

References

[1] Narayanan,Barath Narayanan,Djaneye-Boundjou, et al. Performance analysis of machine learning and pattern recognition algorithms for malware classification[C]//2016 Ieee National Aerospace and Electronics Conference (naecon) and Ohio Innovation Summit (ois), 2016: 338-342.
[2] Vasan,Danish,Alazab, et al. IMCFN: Image-based malware classification using fine-tuned convolutional neural network architecture[J]. Computer Networks, 2020, 171: 107138.
[3] Yuan,Baoguo,Wang, et al. Byte-level malware classification based on markov images and deep learning[J]. Computers \& Security, 2020, 92.
[4] Kamran Shaukat,Suhuai Luo,Vijay Varadharajan. A novel deep learning-based approach for malware detection[J]. Engineering Applications of Artificial Intelligence, 2023, 122: 106030.
[5] Lin,Wei-Cheng,Yeh, et al. Efficient malware classification by binary sequences with one-dimensional convolutional neural networks[J]. Mathematics, 2022, 10(4).
[6] Zhang,Xiaoliang,Wu, et al. MalCaps: a capsule network based model for the malware classification[J]. Processes, 2021, 9(6). [7] Xiao,Mao,Guo
, et al. Image-based malware classification using section distribution information[J]. Computers \& Security, 2021, 110.
[8] Çayır,Aykut,{\"U}nal, et al. Random CapsNet forest model for imbalanced malware type classification task[J]. Computers \& Security, 2021, 102.
[9] Yan,Jiaqi,Yan, et al. Classifying malware represented as control flow graphs using deep graph convolutional neural network[C]//2019 49th Annual Ieee/ifip International Conference on Dependable Systems and Networks (dsn), 2019: 52-63.
[10] Aslan,Ömer,Yilmaz, et al. A new malware classification framework based on deep learning algorithms[J]. Ieee Access, 2021, 9: 87936-87951.
[11] Haiming Wang,Yuntao Zhao,Zijun Wang. Doc2vec-GRU: A Behavior Classifcation Method for Malicious Code[J]: 1-10.
[12] Yesir,Salih,Soğukpinar, et al. Malware detection and classification using fasttext and bert[C]//2021 9th International Symposium on Digital Forensics and Security (isdfs), 2021: 1-6.
[13] Kumar,P Suresh,Mishra, et al. Malware Detection Classification using Recurrent Neural Network[C]//2022 2nd International Conference on Technological Advancements in Computational Sciences (ictacs), 2022: 876-880.
[14] Gibert,Daniel,Mateu, et al. HYDRA: A multimodal deep learning framework for malware classification[J]. Computers \& Security, 2020, 95.
[15] Gibert D, Planes J, Mateu C, et al. Fusing feature engineering and deep learning: A case study for malware classification[J]. Expert Systems with Applications, 2022, 207: 117957.
[16] Seongkyu Yeom 1 Haengrok Oh 2 Dongil Shin 1 and Dongkyoo Shin 1* Sungjoong Kim 1,sejong.ac.kr (S.K.),dae02159, et al. Automatic Malicious Code Classifcation System through Static Analysis Using Machine Learning[J]: 1-11.
[17] Yousuf,Muhammad Irfan,Anwer, et al. Windows malware detection based on static analysis with multiple features[J]. Peerj Computer Science, 2023, 9.
[18] Ullah,Farhan,Srivastava, et al. A malware detection system using a hybrid approach of multi-heads attention-based control flow traces and image visualization[J]. Journal of Cloud Computing, 2022, 11(1).
[19] Singh,Jagsir,Singh, et al. Detection of malicious software by analyzing the behavioral artifacts using machine learning algorithms[J]. Information and Software Technology, 2020, 121.
[20] Kim,Jin-Young,Cho, et al. Obfuscated malware detection using deep generative model based on global/local features[J]. Computers \& Security, 2022, 112.
[21] 熊其冰,郭洋,王世豪.基于多特征融合和增强模型的恶意代码检测方法[J].通信技术,2023,56(05):640-646. Xiong Qibing, Guo Yang, Wang Shihao. Malicious Code Detection Method Based on Multi-Feature Fusion and Enhanced Model[J]. Communications Technology, 2023, 56(05): 640-646.
[22] 李梦,刘万平,黄东.基于特征融合的恶意代码检测[J].计算机工程与设计,2024,45(12):3568-3574.DOI:10.16208/j.issn1000-7024 .2024.12.007. Li Meng, Liu Wanping, Huang Dong. Malicious Code Detection Based on Feature Fusion[J]. Computer Engineering and Design, 2024, 45(12): 3568 - 3574. DOI: 10.16208/j.issn1000 - 7024.2024.12.007.
[23] 王硕,王坚,王亚男,等.一种基于特征融合的恶意代码快速检测方法[J].电子学报,2023,51(01):57-66. Wang Shuo, Wang Jian, Wang Yanan, et al. A Fast Malicious Code Detection Method Based on Feature Fusion[J]. Acta Electronica Sinica, 2023, 51(01): 57-66.
[24] Yan H, Zhang J, Tang Z, et al. Malware classification method based on feature fusion[J]. International Journal of Information Security, 2025, 24(2): 1-17.
[25] Xuan B, Li J, Song Y. BiTCN-TAEfficientNet malware classification approach based on sequence and RGB fusion[J]. Computers & Security, 2024, 139: 103734.
[26] Alessandro Panconesi,Marian,Will Cukierski, et al. Microsoft Malware Classification Challenge (BIG 2015)[Z]: Kaggle, 2015.
[27] MalwareBazaar. (2020). Malware samples repository. [Online]. Available: https://bazaar.abuse.ch/.
[28] Koonce B. ResNet 50[M]//Convolutional neural networks with swift for tensorflow: image recognition and dataset categorization. Berkeley, CA: Apress, 2021: 63-72.
[29] Ojala,Timo,Pietikainen, et al. Performance evaluation of texture measures with classification based on Kullback discrimination of distributions[C]//Proceedings of 12thInternational Conference on Pattern Recognition, 1994: 582-585.
[30] Haralick,Robert M,Shanmugam, et al. Textural features for image classification[J]. Ieee Transactions on Systems, Man, and Cybernetics, 1973, (6): 610-621.
[31] Zhang,Yunan,Huang, et al. Using multi-features and ensemble learning method for imbalanced malware classification. In 2016 Trustcom/BigDataSE/ISPA[Z]: Ieee, 2016. IEEE
[32] Ke,Guolin,Meng, et al. Lightgbm: A highly efficient gradient boosting decision tree[J]. Advances in Neural Information Processing Systems, 2017, 30.引用本文格式: 中文:冯帅,高见.基于融合特征的恶意代码分类方法［J］. ****,****,**,(**):00-0

Please choose a citation manager

Content to export