计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

文本分类中基于K-Sprinkling的特征提取方法

李惠富,陆光,景维鹏   

  1. (东北林业大学 信息与计算机工程学院,哈尔滨 150040)
  • 收稿日期:2017-04-10 出版日期:2017-12-15 发布日期:2017-12-15
  • 作者简介:李惠富(1992—),男,硕士研究生,主研方向为数据挖掘、文本分类;陆光(通信作者)、景维鹏,副教授、博士。
  • 基金项目:
    黑龙江省自然科学基金(F201201);林业公益性行业科研专项(201504307)。

Feature Extraction Method Based on K-Sprinkling in Text Classification

LI Huifu,LU Guang,JING Weipeng   

  1. (College of Information and Computer Engineering,Northeast Forestry University,Harbin 150040,China)
  • Received:2017-04-10 Online:2017-12-15 Published:2017-12-15

摘要: 传统的特征提取方法大多注重类别对特征词的作用,不能很好地表达样本对类别的影响。为此,对样本的类别贡献问题进行研究。针对Sprinkling特征提取方法中未考虑样本对类别的贡献度问题,提出一种基于K-Sprinkling的特征提取方法。综合考虑样本紧密度和样本隶属度信息,利用Sprinkling方法的特点,将样本权值映射到语义空间中,实现对文本的分类。实验结果表明,K-Sprinkling方法比传统的Sprinkling方法在平衡样本分类上F1值提高了1.89%,在不平衡样本分类上F1值提高了3.30%,取得了较好的分类效果。

关键词: 特征提取, 样本隶属度, 样本紧密度, 潜在语义索引, 贡献度

Abstract: The traditional feature extraction methods are mainly focus to the role of the category on the characteristic word for text classification,which do not express the impact of the sample on the classification.In this paper,aiming at the problem that the contribution of the sample to the classis is not detected out from the Sprinkling,and the K-Sprinkling is proposed based on these detected sample tightness and sample membership.Then,by considering the Sprinkling advantages,the sample weights are mapped into the vector feature space to achieve the text classification through the potential semantic indexing method.The experimental results show that the K-Sprinkling method proposed in this paper can obtain better classification performance.It outperforms the traditional method by 1.89% on the balance sample,as well as 3.30% on the imbalance sample in terms of F1-score.

Key words: feature extraction, sample membership, sample tightness, Latent Semantic Indexing(LSI), contribution degree

中图分类号: