作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于文本加权KNN算法的中文垃圾短信过滤

黄文明  a,莫阳  b   

  1. (桂林电子科技大学 a.广西可信软件重点实验室;b.计算机与信息安全学院,广西 桂林 541004)
  • 收稿日期:2016-01-27 出版日期:2017-03-15 发布日期:2017-03-15
  • 作者简介:黄文明(1963—),男,教授,主研方向为人工智能、大数据处理;莫阳,硕士研究生。
  • 基金资助:
    广西可信软件重点实验室研究课题(kx201106);桂林电子科技大学研究生教育创新计划项目(2016YJCX64)。

Chinese Spam Message Filtering Based on Text Weighted KNN Algorithm

HUANG Wenming  a,MO Yang  b   

  1. (a.Guangxi Key Lab of Trusted Software; b.School of Computer Science and Information Security, Guilin University of Electronic Technology,Guilin,Guangxi 541004,China)
  • Received:2016-01-27 Online:2017-03-15 Published:2017-03-15

摘要: 针对K最近邻(KNN)算法在文本分类决策规则上由于样本重要性相同而导致分类效果不佳的问题,提出一种基于文本加权的KNN文本分类算法,并将其应用于垃圾短信的分类问题。在提取出特征词之后,考虑到特征词在文本中出现的频率对文本重要性的影响,引入第1个加权公式,同时针对垃圾短信数据集,采用关联规则算法挖掘出在垃圾短信中频繁出现的共现词组,并以此引入第2个加权公式,最后将引入的2种文本权重计算公式对每个短信文本进行复合加权处理,以区分各个训练样本对于判定隶属类别的影响程度,从而在分类决策规则上作出改进。实验结果表明,与未经过文本加权的KNN算法相比,该算法对垃圾短信和正常短信在分类准确率、召回率、F1值等指标上都有较大的提升。

关键词: 垃圾过滤, 关联规则, 特征选择, K最近邻算法, 向量空间模型

Abstract: In view of the drawback that the decision rules of classification regard for K Nearest Neighbor(KNN),the importance of every sample as the same,the classification results are not good.This paper proposes a method based on the text weighted KNN text classification algorithm and applies it to the classification of spam messages.After feature selection,considering the influence of frequency of feature words appearing in the text on text importance,the paper puts forward the first weighting formula.It uses association rule algorithm to mine frequent term sets from the spam message text and puts forward the second formula.Finally,it uses the two weighting formulas for the composite weighting on every message text so as to distinguish the influence of every training sample on category determination,thus improving on the classification decision rules.Experimental results show that the method has a promotion in accuracy,recall rate and F1 value which are important indexes compared with the un-improved KNN classification of spam filtering.

Key words: spam filtering, association rule, feature selection, K Nearest Neighbor(KNN)algorithm, Vector Space Model (VSM)

中图分类号: