作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (24): 175-177. doi: 10.3969/j.issn.1000-3428.2010.24.063

• 人工智能及识别技术 • 上一篇    下一篇

基于粗糙集的快速KNN文本分类算法

孙荣宗a,b,苗夺谦a,b,卫志华a,b,李 文a,b   

  1. (同济大学 a. 电子与信息工程学院计算机科学与技术系,b. 嵌入式系统与服务计算教育部重点实验室,上海 201804)
  • 出版日期:2010-12-20 发布日期:2010-12-14
  • 作者简介:孙荣宗(1982-),男,硕士,主研方向:文本分类,粗糙集理论;苗夺谦,教授、博士生导师;卫志华、李 文,博士
  • 基金资助:

    国家自然科学基金资助项目(60775036, 60475019);博士学科点专项科研基金资助项目(20060247039)

Fast KNN Algorithm for Text Classification Based on Rough Set

SUN Rong-zong a,b, MIAO Duo-qian a,b, WEI Zhi-hua a,b, LI Wen a,b   

  1. (a. Department of Computer Science and Technology, School of Electronics and Information Engineering; b. Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804, China)
  • Online:2010-12-20 Published:2010-12-14

摘要:

传统K最近邻一个明显缺陷是样本相似度的计算量很大,在具有大量高维样本的文本分类中,由于复杂度太高而缺乏实用性。为此,将粗糙集理论引入到文本分类中,利用上下近似概念刻画各类训练样本的分布,并在训练过程中计算出各类上下近似的范围。在分类过程中根据待分类文本向量在样本空间中的分布位置,改进算法可以直接判定一些文本的归属,缩小K最近邻搜索范围。实验表明,该算法可以在保持K最近邻分类性能基本不变的情况下,显著提高分类效率。

关键词: 文本分类, K最近邻, 粗糙集

Abstract:

The traditional K Nearest Neighbor(KNN) has a fatal defect that time of similarity computing is huge. For text classification task with high dimension and huge samples, it has extremely complexity. This is not practicable for real applications. In this paper, rough set theory is introduced into classification process. The distribution of training samples is described with the concepts of upper approximation and lower approximation and also the range of upper approximation space and lower approximation space of each class are computed in the training process. According to the position of the documents in the sample space, this algorithm can label some documents directly. It reduces the searching range of KNN of some documents in the classification process. The results of experiments show that this algorithm can save largely the classification time and has almost the same classification performance as that of the traditional KNN classification algorithm.

Key words: text classification, K Nearest Neighbor(KNN), rough set

中图分类号: