摘要: 传统的文档频率(DF)方法在进行特征选择时仅考虑特征词在类别中出现的DF,没有考虑特征词在每篇文档中出现的词频率(TF)问题。针对该问题,基于特征词在每篇文档中出现的TF,结合特征词在类别中出现的DF提出特征选择的新算法,并使用支持向量机方法训练分类器。实验结果表明,在进行特征选择时,考虑高词频特征词对类别的贡献,可提高传统DF方法的分类性能。
关键词:
文本分类,
特征选择,
文档频率,
词频率,
支持向量机
Abstract: In traditional Document Frequency(DF) method, the number of a term which is used in a category is the only information for feature selection, wihtout involving the times of a term appearing in a document. For solving the problem, a new feature selection method is proposed in which the Term Frequency(TF) of terms is taken into consider according to the terms’ document frequency, and the Support Vector Machine(SVM) method is used to train the classifier. It is shown by the experiment that considering the term with high term frequency during feature selection can improve the classification performance of the traditional DF method.
Key words:
text classification,
feature selection,
Document Frequency(DF),
Term Frequency(TF),
Support Vector Machine(SVM)
中图分类号:
杨凯峰, 张毅坤, 李燕. 基于文档频率的特征选择方法[J]. 计算机工程, 2010, 36(17): 33-35,38.
YANG Kai-Feng, ZHANG Yi-Kun, LI Yan. Feature Selection Method Based on Document Frequency[J]. Computer Engineering, 2010, 36(17): 33-35,38.