基于粗糙集的快速KNN文本分类算法

doi:10.3969/j.issn.1000-3428.2010.24.063

计算机工程 ›› 2010, Vol. 36 ›› Issue (24): 175-177. doi: 10.3969/j.issn.1000-3428.2010.24.063

基于粗糙集的快速KNN文本分类算法

孙荣宗^a,b，苗夺谦^a,b，卫志华^a,b，李文^a,b

(同济大学 a. 电子与信息工程学院计算机科学与技术系，b. 嵌入式系统与服务计算教育部重点实验室，上海 201804)

出版日期:2010-12-20 发布日期:2010-12-14
作者简介:孙荣宗(1982－)，男，硕士，主研方向：文本分类，粗糙集理论；苗夺谦，教授、博士生导师；卫志华、李文，博士
基金资助:
国家自然科学基金资助项目(60775036, 60475019)；博士学科点专项科研基金资助项目(20060247039)

Fast KNN Algorithm for Text Classification Based on Rough Set

SUN Rong-zong ^a,b, MIAO Duo-qian ^a,b, WEI Zhi-hua ^a,b, LI Wen^a,b

(a. Department of Computer Science and Technology, School of Electronics and Information Engineering; b. Key Laboratory of Embedded System and Service Computing, Ministry of Education, Tongji University, Shanghai 201804, China)

Online:2010-12-20 Published:2010-12-14

摘要/Abstract

摘要：

传统K最近邻一个明显缺陷是样本相似度的计算量很大，在具有大量高维样本的文本分类中，由于复杂度太高而缺乏实用性。为此，将粗糙集理论引入到文本分类中，利用上下近似概念刻画各类训练样本的分布，并在训练过程中计算出各类上下近似的范围。在分类过程中根据待分类文本向量在样本空间中的分布位置，改进算法可以直接判定一些文本的归属，缩小K最近邻搜索范围。实验表明，该算法可以在保持K最近邻分类性能基本不变的情况下，显著提高分类效率。

关键词: 文本分类, K最近邻, 粗糙集

Abstract:

The traditional K Nearest Neighbor(KNN) has a fatal defect that time of similarity computing is huge. For text classification task with high dimension and huge samples, it has extremely complexity. This is not practicable for real applications. In this paper, rough set theory is introduced into classification process. The distribution of training samples is described with the concepts of upper approximation and lower approximation and also the range of upper approximation space and lower approximation space of each class are computed in the training process. According to the position of the documents in the sample space, this algorithm can label some documents directly. It reduces the searching range of KNN of some documents in the classification process. The results of experiments show that this algorithm can save largely the classification time and has almost the same classification performance as that of the traditional KNN classification algorithm.

Key words: text classification, K Nearest Neighbor(KNN), rough set

中图分类号:

TP391

孙荣宗, 苗夺谦, 卫志华, 李文. 基于粗糙集的快速KNN文本分类算法[J]. 计算机工程, 2010, 36(24): 175-177.

SUN Rong-Zong, MIAO Dui-Qian, WEI Zhi-Hua, LI Wen. Fast KNN Algorithm for Text Classification Based on Rough Set[J]. Computer Engineering, 2010, 36(24): 175-177.

http://www.ecice06.com/CN/Y2010/V36/I24/175

[1]	杨璇, 马建敏, 赵曼君. 基于邻域互信息的高维时序数据特征选择[J]. 计算机工程, 2023, 49(7): 135-142.
[2]	张博旭, 蒲智, 程曦. 基于提示学习的维吾尔语文本分类研究[J]. 计算机工程, 2023, 49(6): 292-299,313.
[3]	徐怡, 侯迪. 基于矩阵的粗糙集近似集快速计算算法[J]. 计算机工程, 2023, 49(5): 22-28.
[4]	王春东, 孙嘉琪, 杨文军. 基于矫正理解的中文文本对抗样本生成方法[J]. 计算机工程, 2023, 49(2): 37-45.
[5]	杨泽雪, 王阿川, 李陆, 李松. 障碍环境中可视反向视域K最近邻查询[J]. 计算机工程, 2022, 48(8): 258-265.
[6]	吴正江, 张亚宁, 张真, 梅秋雨, 杨天. 拟单层覆盖粗糙集中近似集的增量更新算法[J]. 计算机工程, 2022, 48(6): 200-206,212.
[7]	陈可嘉, 刘惠. 基于改进BiGRU-CNN的中文文本分类方法[J]. 计算机工程, 2022, 48(5): 59-66,73.
[8]	李冉冉, 刘大明, 刘正, 常高祥. 融合笔画特征的胶囊网络文本分类[J]. 计算机工程, 2022, 48(3): 69-73,80.
[9]	陆怡, 王鹏, 汪卫. 基于子序列相似性的时间序列语义挖掘算法[J]. 计算机工程, 2022, 48(10): 88-94.
[10]	葛君伟, 杨广欣. 基于共享最近邻的密度自适应邻域谱聚类算法[J]. 计算机工程, 2021, 47(8): 116-123.
[11]	武娇, 洪彩凤, 顾永春, 顾兴全, 金世举. 基于类邻域字典的线性回归文本分类[J]. 计算机工程, 2021, 47(8): 93-99,108.
[12]	彭俊利, 谷雨, 张震, 耿小航. 融合单词贡献度与Word2Vec词向量的文档表示[J]. 计算机工程, 2021, 47(4): 62-67.
[13]	周伟枭, 蓝雯飞. 融合文本分类的多任务学习摘要模型[J]. 计算机工程, 2021, 47(4): 48-55.
[14]	孙静勇, 马福民. 基于邻域归属信息混合度量的粗糙K-Means算法[J]. 计算机工程, 2021, 47(3): 109-116.
[15]	何力, 郑灶贤, 项凤涛, 吴建宅, 谭林. 基于深度学习的文本分类技术研究进展[J]. 计算机工程, 2021, 47(2): 1-11.

选择文件类型/文献管理软件名称

选择包含的内容

基于粗糙集的快速KNN文本分类算法

Fast KNN Algorithm for Text Classification Based on Rough Set

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于粗糙集的快速KNN文本分类算法

Fast KNN Algorithm for Text Classification Based on Rough Set

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价