作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (09): 100-102.

• 软件技术与数据库 • 上一篇    下一篇

基于通用后缀树模型的垃圾邮件过滤方法

谭建龙,张 吉,郭 莉   

  1. (中国科学院计算技术研究所软件室,北京 100085)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-05-05 发布日期:2007-05-05

Method of Spam Filtering Based on General Suffix Tree Model

TAN Jianlong, ZHANG Ji, GUO Li   

  1. (Software Division, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100085)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-05-05 Published:2007-05-05

摘要: 采用通用后缀树模型(GSTM),利用邮件内容的上下文信息,进行每个文本位置的不定长多元统计,从而获得被测邮件与不同训练集的相似程度,确定邮件所属的类别。理论分析和实验表明,在相同语料上,该方法的精确度和召回率均达到或超过了基于向量空间模型的邮件过滤方法;对于长度为N的邮件,过滤时间为O(N);长度为N的新邮件加入训练集,训练时间为O(N),满足了训练集的动态增长;该方法不需进行分词处理,完全独立于语种,适用于多语种邮件同时存在的情况。

关键词: 文本分类, 垃圾邮件, 通用后缀树

Abstract: The paper proposes a method of spam filtering based on content. It adopts general suffix tree model(GSTM), takes advantage of context location, and does string match of unfixed length, then computes the similarity between test mail and the corpus to determine the sort of E-mail. The experiments and analyses prove that the method is better than other methods based on vector space model(VSM) in both accuracy and recall when tested on the same corpus. The avoidance of word segmentation shows that the categorizing process is irrelevant with the concrete language and is a language independent method.

Key words: Text classify, Spam, General suffix tree

中图分类号: