作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (24): 69-71. doi: 10.3969/j.issn.1000-3428.2009.24.023

• 软件技术与数据库 • 上一篇    下一篇

基于KNN的不良文本过滤方法

王洪彬,刘晓洁   

  1. (四川大学计算机学院,成都 610065)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-12-20 发布日期:2009-12-20

Reactionary Text Filtering Method Based on K-Nearest Neighbor

WANG Hong-bin, LIU Xiao-jie   

  1. (School of Computer, Sichuan University, Chengdu 610065)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-12-20 Published:2009-12-20

摘要: 不良文本过滤是当前的一个研究热点。通过对χ2 统计量的具体分析,证明χ2 统计量在2类文本特征项提取过程中特有的优势。提出正面文本阈值δ,并从理论上推断出该值的大小。在此基础上改进KNN算法,消除了KNN算法中N的不确定性,彻底实现了无参性,大幅减少了分类所用的时间。实验证明,该算法符合Web实时在线分类的要求。

关键词: KNN算法, 不良文本过滤, χ2统计量

Abstract: Reactionary text filtering is a hot research now. This paper proves that statistics χ2 has the unique advantages in the features extraction of the two types of texts based on statistics χ2 analysis. It proposes the threshold δ of the positive texts and infers the value of it in theory, and the K-Nearest Neighbor(KNN) algorithm is improved. This algorithm eliminates the uncertainty of KNN algorithm, realizes no reference, and reduces the time used in the text categorization. Experimental results show that the algorithm meets the real-time online text categorization.

Key words: K-Nearest Neighbor(KNN) algorithm, reactionary text filtering, statistics χ2

中图分类号: