作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 232-243. doi: 10.19678/j.issn.1000-3428.0252241

• 网络空间安全 • 上一篇    下一篇

基于多维特征融合的恶意代码分类方法

冯帅, 高见*()   

  1. 中国人民公安大学信息网络安全学院, 北京 100038
  • 收稿日期:2025-03-18 修回日期:2025-04-29 出版日期:2025-12-15 发布日期:2025-07-03
  • 通讯作者: 高见
  • 基金资助:
    中国人民公安大学中央基本科研业务费项目(2024JKF17)

Malicious Code Classification Method Based on Multidimensional Feature Fusion

FENG Shuai, GAO Jian*()   

  1. College of Information Network Security, People's Public Security University of China, Beijing 100038, China
  • Received:2025-03-18 Revised:2025-04-29 Online:2025-12-15 Published:2025-07-03
  • Contact: GAO Jian

摘要:

在计算机安全领域, 恶意代码防护一直是重要的研究课题。随着计算机技术的快速发展, 恶意代码的种类和形式不断演变, 传统特征工程方法在处理复杂恶意样本时特征维度单一, 致使表征能力不足, 无法精准识别各类恶意代码。其他基于特征融合的恶意代码分类方法在特征提取过程中依赖专家经验进行手工特征设计, 而多模态深度学习模型可解释性不足, 计算开销大。针对上述问题, 提出一种新的特征融合方法, 该方法应用于Windows PE文件的恶意代码分类, 通过整合行为特征、结构特征及纹理特征, 并采用LightGBM作为分类器完成对恶意代码的分类。实验结果表明, 该方法在Microsoft恶意软件分类挑战赛数据集上的测试准确率为99.90%, 对数损失为0.005 7, 在Bazaar数据集上的测试准确率为98.97%, 对数损失为0.042 0。所提方法能够全面、准确地表征恶意代码, 其通过融合多维特征, 为恶意代码检测提供了一种有效的解决方案, 具有重要的理论意义和实际的应用价值。

关键词: 恶意代码, 特征融合, 特征工程, 多模态, LightGBM

Abstract:

Malicious code protection is an important research topic in computer security. With the rapid development of computer technology, the types and forms of malicious codes continue to evolve. Traditional feature engineering methods have a single feature dimension when dealing with complex malicious samples, resulting in insufficient representation ability and inability to accurately identify various types of malicious code. Malicious code classification methods based on feature fusion rely on expert experience in manual feature design during feature extraction, whereas multimodal deep learning models have insufficient interpretability and a high computational overhead. A new feature fusion method is proposed to address the aforementioned issues and applied to the classification of malicious code in Windows PE files. By integrating behavioral, structural, and textural features, LightGBM is used as a classifier to complete the classification of the malicious code. The experimental results show that the testing accuracy of this method on the Microsoft Malicious Software Classification Challenge dataset is 99.90% with a logarithmic loss of 0.005 7. The testing accuracy on the Bazaar dataset is 98.97%, with a logarithmic loss of 0.042 0. The proposed method can comprehensively and accurately characterize malicious code. By integrating multidimensional features, the method provides an effective solution for malicious code detection, which has important theoretical significance and practical application value.

Key words: malicious code, feature fusion, feature engineering, multimodal, LightGBM