摘要: 针对客户端垃圾邮件过滤器难以获取足够训练样本的问题,提出一种基于小样本学习的垃圾邮件过滤方法,利用容易获取的未标记样本提高垃圾邮件过滤的性能。该方法使用已标记的小样本邮件实例集训练一个初始Na?ve Bayes分类器,以此标注未标记邮件,再使用所有数据训练新的分类器,利用EM算法进行迭代直至收敛。实验结果证明,当给定5个~20个已标记小样本训练邮件时,该方法可有效提高垃圾邮件过滤性能。
关键词:
小样本学习,
EM算法,
未标记数据,
垃圾邮件过滤
Abstract: It is difficult to collect sufficient labeled E-mails for training a client spam classifier. Aiming at the problem, this paper proposes a spam filtering method based on learning from small samples, which improves the filtering performance with unlabeled samples. An initial Na?ve Bayes(NB) classifier is trained with a dataset of labeled E-mails, and unlabeled E-mails are probabilistically labeled with it. A new classifier is trained with all E-mails, and iterates to convergence with EM algorithm. Experimental results prove that, given labeled small training samples with a size of 5 to 20, the performance of spam filtering can be effectively improved.
Key words:
learning from small samples,
EM algorithm,
unlabeled data,
spam filtering
中图分类号:
潘洁珠, 周晓, 吴共庆, 胡学钢. 基于小样本学习的垃圾邮件过滤方法[J]. 计算机工程, 2010, 36(21): 245-247.
BO Ji-Zhu, ZHOU Xiao, TUN Gong-Qiang, HU Hua-Gang. Spam Filtering Method Based on Learning from Small Samples[J]. Computer Engineering, 2010, 36(21): 245-247.