作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2020, Vol. 46 ›› Issue (12): 299-304,312. doi: 10.19678/j.issn.1000-3428.0056577

• 开发研究与工程应用 • 上一篇    下一篇

基于IMI-WNB算法的垃圾邮件过滤技术研究

刘洁, 王铮, 王辉   

  1. 河南理工大学 计算机科学与技术学院, 河南 焦作 454000
  • 收稿日期:2019-11-13 修回日期:2019-12-25 发布日期:2020-01-14
  • 作者简介:刘洁(1979-),女,副教授、硕士,主研方向为网络安全、数据库技术、软件技术;王铮(通信作者),硕士研究生;王辉,副教授、博士。
  • 基金资助:
    国家自然科学基金(61300216)。

Research on Spam Filtering Technology Based on IMI-WNB Algorithm

LIU Jie, WANG Zheng, WANG Hui   

  1. School of Computer Science and Technology, Henan Polytechnic University, Jiaozuo, Henan 454000, China
  • Received:2019-11-13 Revised:2019-12-25 Published:2020-01-14

摘要: 互信息和朴素贝叶斯算法应用于垃圾邮件过滤时,存在特征冗余和独立性假设不成立的问题。为此,提出一种改进互信息的加权朴素贝叶斯算法。针对互信息效率较低的问题,通过引入词频因子与类间差异因子,提出一种改进的互信息特征选择算法,从而实现更高效的特征降维。针对朴素贝叶斯分类算法的独立性假设问题,在朴素贝叶斯分类时使用改进互信息值进行特征加权,消除部分朴素贝叶斯条件独立性假设对邮件分类的不利影响。实验结果表明,相比传统朴素贝叶斯算法,该算法提高了垃圾邮件过滤的精确度、召回率与稳定性。

关键词: 互信息, 垃圾邮件过滤, 加权朴素贝叶斯算法, 特征选择, 词频

Abstract: The application of Mutual Information(MI) and Naive Bayes(NB) algorithm to spam filtering is faced with feature redundancy and invalid independence assumption.To address the problem,this paper proposes an Improved Mutual Information-Weighted Naive Bayes(IMI-WNB) algorithm.As for the low efficiency of mutual information,an improved feature selection algorithm based on MI is proposed by introducing the word frequency factor and inter-class difference factor in order to achieve more efficient feature dimensionality reduction.To solve the problem of independence assumption of NB classification algorithm,the Improved Mutual Information(IMI) value is used for feature weighting in NB classification,which eliminates the adverse effect of part of the NB conditional independence assumption on mail classification.The experimental results show that compared with the traditional NB algorithm,the proposed algorithm improves the accuracy,recall rate and stability of spam filtering.

Key words: Mutual Information(MI), spam filtering, Weighted Naive Bayes(WNB) algorithm, feature selection, word frequency

中图分类号: