作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (17): 33-35,38. doi: 10.3969/j.issn.1000-3428.2010.17.012

• 软件技术与数据库 • 上一篇    下一篇

基于文档频率的特征选择方法

杨凯峰,张毅坤,李 燕   

  1. (西安理工大学计算机科学与工程学院,西安 710048)
  • 出版日期:2010-09-05 发布日期:2010-09-02
  • 作者简介:杨凯峰(1971-),男,讲师,主研方向:数据挖掘; 张毅坤,教授;李 燕,讲师
  • 基金资助:
    陕西省自然科学基金资助项目(2009jm8003-1)

Feature Selection Method Based on Document Frequency

YANG Kai-feng, ZHANG Yi-kun, LI Yan   

  1. (School of Computer Science and Engineering, Xi’an University of Technology, Xi’an 710048)
  • Online:2010-09-05 Published:2010-09-02

摘要: 传统的文档频率(DF)方法在进行特征选择时仅考虑特征词在类别中出现的DF,没有考虑特征词在每篇文档中出现的词频率(TF)问题。针对该问题,基于特征词在每篇文档中出现的TF,结合特征词在类别中出现的DF提出特征选择的新算法,并使用支持向量机方法训练分类器。实验结果表明,在进行特征选择时,考虑高词频特征词对类别的贡献,可提高传统DF方法的分类性能。

关键词: 文本分类, 特征选择, 文档频率, 词频率, 支持向量机

Abstract: In traditional Document Frequency(DF) method, the number of a term which is used in a category is the only information for feature selection, wihtout involving the times of a term appearing in a document. For solving the problem, a new feature selection method is proposed in which the Term Frequency(TF) of terms is taken into consider according to the terms’ document frequency, and the Support Vector Machine(SVM) method is used to train the classifier. It is shown by the experiment that considering the term with high term frequency during feature selection can improve the classification performance of the traditional DF method.

Key words: text classification, feature selection, Document Frequency(DF), Term Frequency(TF), Support Vector Machine(SVM)

中图分类号: