作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (12): 155-157. doi: 10.3969/j.issn.1000-3428.2012.12.046

• 人工智能及识别技术 • 上一篇    下一篇

一种基于方差的文本特征选择算法

袁 轶,王新房   

  1. (西安理工大学自动化与信息工程学院,西安 710048)
  • 收稿日期:2011-07-15 出版日期:2012-06-20 发布日期:2012-06-20
  • 作者简介:袁 轶(1986-),女,硕士研究生,主研方向:文本分类,模式识别,大系统理论;王新房,教授

Text Feature Selection Algorithm Based on Variance

YUAN Yi, WANG Xin-fang   

  1. (School of Automation & Information Engineering, Xi’an University of Technology, Xi’an 710048, China)
  • Received:2011-07-15 Online:2012-06-20 Published:2012-06-20

摘要: 中文文本分类中传统特征选择算法在低维情况下分类效果不佳。为此,提出一种结合方差思想的评估函数,选出具有较强类别信息的词条,在保证整体分类性能不下降的同时,提高稀有类别的分类精度。采用中心向量分类器,在TanCorpV1.0语料上进行实验,结果表明,该方法在低维空间优势明显,与常用的文档频率、信息增益等9种特征选择算法相比,宏平均值均有较大提高。

关键词: 文本分类, 特征选择, 方差, 类别信息, 宏平均

Abstract: The effectiveness of traditional feature selection method is not good when feature dimension is low. A new method based on variance is proposed to solve this problem. This approach can select class information words in order to maintain categorization accuracy and improve the performance of rare classes. This paper gives a comparative analysis between the new method and other traditional feature selection methods such as Document Frequency(DF), Information Gain(IG), Mutual Information(MI), Chi-square Statistics(CHI), etc. Experiment takes Rocchio as the evaluation classifier. Experimental results on TanCorpV1.0 corpora show that the new feature selection Variance Feature Selection Method(VFSM) outperforms the traditional ones when using macro-averaged-measures F1.

Key words: text categorization, feature selection, variance, class information, macro-averaged-measures

中图分类号: