Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2021, Vol. 47 ›› Issue (8): 93-99,108. doi: 10.19678/j.issn.1000-3428.0058692

• Artificial Intelligence and Pattern Recognition • Previous Articles     Next Articles

Linear Regression Text Classification Based on Class-wise Nearest Neighbor Dictionary

WU Jiao1, HONG Caifeng1, GU Yongchun1, GU Xingquan2, JIN Shiju1   

  1. 1. College of Sciences, China Jiliang University, Hangzhou 310018, China;
    2. College of Standardization, China Jiliang University, Hangzhou 310018, China
  • Received:2020-06-22 Revised:2020-08-12 Published:2020-08-20

基于类邻域字典的线性回归文本分类

武娇1, 洪彩凤1, 顾永春1, 顾兴全2, 金世举1   

  1. 1. 中国计量大学 理学院, 杭州 310018;
    2. 中国计量大学 标准化学院, 杭州 310018
  • 作者简介:武娇(1976-),女,副教授、博士,主研方向为机器学习、文本挖掘、压缩感知;洪彩凤、顾永春,硕士研究生;顾兴全,副教授;金世举,讲师。
  • 基金资助:
    国家自然科学基金(61302190)。

Abstract: In text classification, the high dimensionality of text representation increases the computational complexity. To address the problem, a Linear Regression Classification(LRC) model is constructed based on neighborhood dictionary. The K-Nearest Neighbor(KNN) method is used to construct the neighbor dictionary for each class, and the LRC algorithms based on the concatenate class-wise nearest neighbor dictionary and the class-wise nearest neighbor dictionary are proposed separately according to the different representations of the test sample. In addition, the correlation between the sample and the classes is measured to clip the noise data, alleviating the impact of noise data on classification performance. The experimental results show that the proposed model provides high classification accuracy and calculation efficiency for long texts and short texts. For those texts with multiple classes, the strategy of noise class clipping also enables it to display excellent classification performance.

Key words: Spares Representation Classification(SRC), K-Nearest Neighbor(KNN), dictionary learning, Linear Regression Classification(LRC), text classification

摘要: 文本表示的高维性会增加文本分类时的计算复杂度。针对该问题,构建基于类邻域字典的线性回归分类模型。采用K近邻方法构造各类别的类邻域字典,根据对测试样本的不同表示,分别提出基于级联类邻域字典和基于类邻域字典的线性回归分类算法。此外,为缓解噪声数据对分类性能的影响,通过度量测试样本与各个类别之间的相关度裁剪噪声类数据。实验结果表明,该模型对长文本和短文本均能够得到较高的分类精度和计算效率,同时,噪声类裁剪策略使其对包含较多类别数的文本语料也具有较好的分类性能。

关键词: 稀疏表示分类, K近邻, 字典学习, 线性回归分类, 文本分类

CLC Number: