作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (21): 245-247. doi: 10.3969/j.issn.1000-3428.2010.21.088

• 开发研究与设计技术 • 上一篇    下一篇

基于小样本学习的垃圾邮件过滤方法

潘洁珠1,周 晓1,吴共庆2,胡学钢2   

  1. (1. 合肥师范学院计算机科学与技术系,合肥 230061;2. 合肥工业大学计算机与信息学院,合肥 230009)
  • 出版日期:2010-11-05 发布日期:2010-11-03
  • 作者简介:潘洁珠(1979-),女,讲师、硕士,主研方向:数据挖掘; 周 晓、吴共庆,讲师、硕士;胡学钢,教授、博士
  • 基金资助:
    国家“973”计划基金资助项目(2009CB326203);国家自然科学基金资助项目(60975034);安徽高等学校省级自然科学研究基金资助项目(KJ2009B238Z)

Spam Filtering Method Based on Learning from Small Samples

PAN Jie-zhu1, ZHOU Xiao1, WU Gong-qing2, HU Xue-gang2   

  1. (1. Department of Computer Science and Technology, Hefei Normal University, Hefei 230061, China; 2. School of Computer and Information, Hefei University of Technology, Hefei 230009, China)
  • Online:2010-11-05 Published:2010-11-03

摘要: 针对客户端垃圾邮件过滤器难以获取足够训练样本的问题,提出一种基于小样本学习的垃圾邮件过滤方法,利用容易获取的未标记样本提高垃圾邮件过滤的性能。该方法使用已标记的小样本邮件实例集训练一个初始Na?ve Bayes分类器,以此标注未标记邮件,再使用所有数据训练新的分类器,利用EM算法进行迭代直至收敛。实验结果证明,当给定5个~20个已标记小样本训练邮件时,该方法可有效提高垃圾邮件过滤性能。

关键词: 小样本学习, EM算法, 未标记数据, 垃圾邮件过滤

Abstract: It is difficult to collect sufficient labeled E-mails for training a client spam classifier. Aiming at the problem, this paper proposes a spam filtering method based on learning from small samples, which improves the filtering performance with unlabeled samples. An initial Na?ve Bayes(NB) classifier is trained with a dataset of labeled E-mails, and unlabeled E-mails are probabilistically labeled with it. A new classifier is trained with all E-mails, and iterates to convergence with EM algorithm. Experimental results prove that, given labeled small training samples with a size of 5 to 20, the performance of spam filtering can be effectively improved.

Key words: learning from small samples, EM algorithm, unlabeled data, spam filtering

中图分类号: