摘要: 基于质心的文本分类方法对模型较敏感,分类性能较差。为此,提出一种基于特征选择的类别质心向量构建方法FSCC。计算特征与类别之间的特征选择值,利用质心特征权重计算公式得到类别的质心向量,并采用非归一化的余弦相似度计算文档与质心间的距离,实现文本分类。实验结果表明,与基于质心的方法和支持向量机方法相比,FSCC方法的分类效果更好。
关键词:
特征选择,
特征权重,
余弦相似度,
质心,
文本分类
Abstract: Text categorization method based on centroid shows poor performance. This paper proposes a centroid vector construction method based on feature selection named FSCC. By computing feature selection value between features and categories, the centroid vector are calculateed by the formula of centroid feature weight. Finally, a non-normalized cosine similarity measure is employed to calculate the similarity score between a text vector and a centroid. Experimental result show that FSCC significantly outperforms traditional centroid-based methods and state-of-the-art Support Vector Machine(SVM).
Key words:
feature selection,
feature weight,
cosine similarity,
centroid,
text classification
中图分类号:
谢华, 王健, 林鸿飞, 杨志豪. 基于特征选择的质心向量构建方法[J]. 计算机工程, 2012, 38(01): 195-196,210.
XIE Hua, WANG Jian, LIN Hong-Fei, YANG Zhi-Hao. Centroid Vector Construction Method Based on Feature Selection[J]. Computer Engineering, 2012, 38(01): 195-196,210.