基于多分类器组合模型的垃圾邮件过滤

doi:10.3969/j.issn.1000-3428.2010.18.067

计算机工程 ›› 2010, Vol. 36 ›› Issue (18): 194-196. doi: 10.3969/j.issn.1000-3428.2010.18.067

基于多分类器组合模型的垃圾邮件过滤

刘菊新，徐从富

(浙江大学计算机科学与技术学院，杭州 310027)

出版日期:2010-09-20 发布日期:2010-09-30
作者简介:刘菊新(1982－)，男，硕士研究生，主研方向：人工智能，机器学习，文本分类；徐从富，副教授
基金资助:
国家“863”计划基金资助项目(2007AA01Z197)

Spam Filter Based on Multiple Classifier Combinational Model

LIU Jiu-xin, XU Cong-fu

(College of Computer Science and Technology, Zhejiang University, Hangzhou 310027, China)

Online:2010-09-20 Published:2010-09-30

摘要/Abstract

摘要： 针对垃圾邮件过滤中代价不对等问题，即正常邮件被误判为垃圾邮件的代价远大于垃圾邮件被误判为正常邮件，构建一种使用 2层结构的组合分类器框架。对样本邮件进行预处理，使文本特征和行为特征相结合。在提高单分类器性能的基础上，对不同分类器进行组合优化，并通过反馈及时调整模型，实现高效的自学习功能。

关键词: 垃圾邮件过滤, 组合分类器, 2层结构, 比特熵, 误判率

Abstract: Aiming at the unequal cost problem of spam filter that the cost of ham misclassification is much more than the cost of spam misclassification, this paper proposes a combinational classifier with two-layer structure. Email samples are pre-processed. The filter combines the behavioral features and text features, and optimizes the combination of different classifiers based on improving the performance of a single one. The classifier adjusts the model by timely feedback to make the filter obtain efficient self-learning function.

Key words: spam filter, combinational classifier, two-layer structure, bit entropy, false positive rate

中图分类号:

TP181

刘菊新, 徐从富. 基于多分类器组合模型的垃圾邮件过滤[J]. 计算机工程, 2010, 36(18): 194-196.

LIU Ju-Xin, XU Cong-Fu. Spam Filter Based on Multiple Classifier Combinational Model[J]. Computer Engineering, 2010, 36(18): 194-196.

http://www.ecice06.com/CN/Y2010/V36/I18/194

参考文献

[1] 李睿, 李伟娟, 李明. 基于加权量子粒子群的分类器设计[J].计算机工程, 2010, 36(7): 203-204. [2] Sahami M, Dumais S, Heckerman D, et al. A Bayesian Approach to Filtering Junk E-mail[C]//Proceedings of the AAAI Workshop on Learning for Text Categorization. Madison, Wisconsin, USA: [s. n.], 1998. [3] Lee Weesun, Liu Bing. Learning with Positive and Unlabeled Examples Using Weighted Logistic Regression[C]//Proc. of the 20th International Conference on Machine Learning. Washington D. C., USA: [s. n.], 2003: 448-455. [4] Bratko A, Filipi? B, Cormack G V, et al. Spam Filtering Using Statistical Data Compression Models[J]. Machine Learning Research, 2006, 7: 2673-2698. [5] Littlestone N. Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm[J]. Machine Learning, 1988, 2(4): 285-318. [6] Hershkop S, Stolfo J. Combining Email Models for False Positive Reduction[C]//Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Chicago, Illinois, USA: [s. n.], 2005: 98-107. [7] Segal R, Crawford J, Kephart J, et al. SpamGuru: An Enterprise Anti-spam Filtering System[C]//Proceedings of the 1st Conference on Email and Anti-spam. California, USA: [s. n.], 2004. [8] Li Yang, Fang Binxing, Li Guo. A Novel Online Spam Filter Based on URLs and Maximum Entropy Model[EB/OL]. [2010-01- 09]. ]http://www.ict.ac.cn/grope/down/07-09/1189134311.doc. [9] Lin Chih-jen, Weng R C, Keerthi S S. Trust Region Newton Method for Large-scale Logistic Regression[C]//Proceedings of the 24th International Conference on Machine Learning. Corvalis, Oregon, USA: [s. n.], 2007: 561-568. [10] Howard P G. The Design and Analysis of Efficient Lossless Data Compression Systems[D]. Rhode Island, USA: Brown University, 1993. [11] Goodman J, Yih W T. Online Discriminative Spam Filter Training[C]//Proceedings of the 3rd Conference on Email and Anti-spam. CA, USA: 2006: 27-28.

[1]	刘洁, 王铮, 王辉. 基于IMI-WNB算法的垃圾邮件过滤技术研究[J]. 计算机工程, 2020, 46(12): 299-304,312.
[2]	马旸,强小辉,蔡冰,王林汝. 大规模网络中基于集成学习的恶意域名检测[J]. 计算机工程, 2016, 42(11): 170-176.
[3]	曾青华，袁家斌，张云洲. 基于Hadoop的贝叶斯过滤MapReduce模型[J]. 计算机工程, 2013, 39(11): 57-60,64.
[4]	张敏, 曾晓辉. 基于优选信息熵的语音端点检测方法[J]. 计算机工程, 2012, 38(19): 170-174.
[5]	潘洁珠, 周晓, 吴共庆, 胡学钢. 基于小样本学习的垃圾邮件过滤方法[J]. 计算机工程, 2010, 36(21): 245-247.
[6]	胡乃全;朱巧明;周国栋;. 混合的汉语基本名词短语识别方法[J]. 计算机工程, 2009, 35(20): 199-201.
[7]	王祖辉;姜维. 基于支持向量机的垃圾邮件过滤方法[J]. 计算机工程, 2009, 35(13): 188-189,.
[8]	董建设;袁占亭;张秋余. 代价敏感支持向量机在垃圾邮件过滤中的应用[J]. 计算机工程, 2008, 34(10): 131-132.
[9]	张国华;万钧力. 基于主分量分析法的脱机手写数字识别[J]. 计算机工程, 2007, 33(18): 219-221.
[10]	张文良;黄亚楼;倪维健. 基于差分贡献的垃圾邮件过滤特征选择方法[J]. 计算机工程, 2007, 33(08): 80-82.
[11]	李洋;方滨兴;王申;. 基于用户反馈的反垃圾邮件技术[J]. 计算机工程, 2007, 33(08): 130-132.
[12]	张羿;周建国;晏蒲柳. 垃圾邮件过滤系统的研究与实现[J]. 计算机工程, 2006, 32(18): 106-108,.

选择文件类型/文献管理软件名称

选择包含的内容

基于多分类器组合模型的垃圾邮件过滤

Spam Filter Based on Multiple Classifier Combinational Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于多分类器组合模型的垃圾邮件过滤

Spam Filter Based on Multiple Classifier Combinational Model

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 12

编辑推荐

Metrics

本文评价