作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2006, Vol. 32 ›› Issue (17): 60-62,6. doi: 10.3969/j.issn.1000-3428.2006.17.021

• 专题论文 • 上一篇    下一篇

基于词频的权值计算在邮件过滤算法中的应用

刘 慧1,2;马 军1;雷景生1,3;宋 玲4   

  1. (1. 山东大学计算机科学与技术学院,济南 250061;2. 山东经济学院计算机科学与技术学院,济南 250014; 3. 海南大学信息科学技术学院,海口 570228;4. 山东建筑大学计算机科学与技术学院,济南 250014)

  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-09-05 发布日期:2006-09-05

Application of Weight Calculation Based on Term Frequency for E-mail Filtering

LIU Hui1,2; MA Jun1; LEI Jingsheng1,3; SONG Ling4   

  1. (1. School of Computer Science and Technology, Shandong Univ., Jinan 250061; 2. School of Computer Science and Technology, Shandong Economic Univ., Jinan 250014; 3. School of Information Technology Science, Hainan Univ., Haikou 570228; 4. School of Computer Science and Technology, Shandong Construction Univ., Jinan 250014)
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-09-05 Published:2006-09-05

摘要: 基于文本分类的过滤方法是目前解决垃圾邮件危机的主要手段,但仍然缺乏规范化的模式和方法以及较高检索性能的过滤机制。该文提出了借助邮件特征域的思想解决上述问题,介绍了特征词与特征域的概念,从利用类间相关评估函数对训练语料进行分析入手,构建特征词典。分析了邮件特征域在邮件主题表达力方面的重要作用,给出了基于特征域词频TF的权值计算方法,并改进了传统的文本相似度计算概率模型。通过实验加以验证,说明提出的方法在邮件过滤的查全率、查准率等几个性能评价指标上,比传统的Rocchio方法有了明显改善。

关键词: 邮件过滤, 特征词, 特征域, 词频, 权值计算

Abstract: The E-mail filtering based on text classification has become the main method to solve spam crisis at present, but it is still short of standardized pattern and filtering mechanism with high retrieval performance. This paper settles these problems in virtue of E-mail character field. It introduces the concepts of character term and character field, and constructs character lexicon by analyzing training materials using class correlation evaluation function. This paper presents the weight calculation based on term frequency by analyzing the important effects of E-mail character field in topic expression, and improves the traditional probabilistic model of resemblance calculation. Experiments show that this method plays better performance than Rocchio in terms of recall, precision and so on.

Key words: E-mail filtering, Character term, Character field, Term frequency, Weight calculation

中图分类号: