摘要: 提出一种基于类别区分度和关联性分析的综合特征选择算法。利用类别区分度提取具有较强类别区分能力的特征词,降低特征空间的稀疏性,通过特征的关联性分析衡量特征与类别的相关性及特征之间的冗余度,选出具有类别代表性且相互之间不存在冗余的特征词。实验结果表明,该算法能有效提高分类器性能。
关键词:
文本分类,
特征选择,
关联性分析,
类别区分度,
相关独立度
Abstract: This paper proposes a syntaxic feature selection algorithm based on category discrimination degree and correlation analysis. The algorithm uses discrimination degree to extract the features that reveal larger differences among categories to reduce the sparsity of feature spaces, and emploies correlation analysis of features to measure relativity between features and categories and redundancy among features, so it can acquire the feature subsets which are more representative and have no redundancy between each other. Experimental results show that the proposed algorithm can improve the performance of the classifier effectively.
Key words:
text categorization,
feature selection,
correlation analysis,
category discrimination degree,
relevant independence degree
中图分类号:
陈建华, 王治和, 蒋芸. 基于类别区分度和关联性分析的综合特征选择[J]. 计算机工程, 2012, 38(9): 186-188,192.
CHEN Jian-Hua, WANG Chi-He, JIANG Yun. Syntaxic Feature Selection Based on Category Discrimination Degree and Correlation Analysis[J]. Computer Engineering, 2012, 38(9): 186-188,192.