作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

一种基于概率的卡方特征选择方法

张辉宜,谢业名,袁志祥,孙国华   

  1. (安徽工业大学 计算机科学与技术学院,安徽 马鞍山 243032)
  • 收稿日期:2015-07-27 出版日期:2016-08-15 发布日期:2016-08-15
  • 作者简介:张辉宜(1963-),男,教授,主研方向为机器学习;谢业名,硕士研究生;袁志祥,副教授;孙国华,讲师。
  • 基金资助:
    国家科技支撑计划基金资助项目“节能减排监测控制技术信息集成平台开发”(2012BAK30B04-02)。

A Method of CHI-square Feature Selection Based on Probability

ZHANG Huiyi,XIE Yeming,YUAN Zhixiang,SUN Guohua   

  1. (School of Computer Science and Technology,Anhui University of Technology,Maanshan,Anhui 243032,China)
  • Received:2015-07-27 Online:2016-08-15 Published:2016-08-15

摘要: 传统卡方特征选择方法没有考虑在不均衡数据集上词出现的类别数量、词的频度以及在类间与类内的分布情况等,以致不能为不同的类别选择出有效的特征词。为此,提出一种卡方特征选择方法。以词概率和文档概率衡量词文档频繁程度,并用来分别计算类别频数因子、词的类间集中因子、词在类内的均衡度因子、文档的类间集中因子。基于这些因子修正卡方值,利用同一个词不同类别的差异程度因子,使得改进的卡方能选出更高效的特征词。文本分类实验结果表明,与改进前的方法相比,该方法能使宏观F1值得到一定程度的提高,在不均衡数据集上具有更好的分类效果。

关键词: 文本分类, 卡方统计, 特征选择, 不均衡数据集, 概率方法

Abstract: Traditional CHI-square feature selection method does not take into account the category number of words in imbalanced data sets,the frequency of words,the intra-class and inter-class distribution of words,so that it fails to choose valid feature words for different categories.To solve this problem,a CHI-square feature selection method based on probability is proposed.It is used to measure the frequency of words and documents by probability of words and documents,and calculates the frequency factor of categories,the concentration factors of words between classes,equilibrium degree factors of words in the same classes and the concentration factors of documents between classes.The initial value of CHI-square is adjusted by these factors.The difference degree factor of different classes for the same word is used to make the improved CHI-square select more efficient words.Text classification experiment results show that,compared with the CHI-square feature selection method without improvement,the proposed method improves macroscopic F1 significantly,and has better classification performance on imbalanced datasets.

Key words: text categorization, CHI-square statistic(CHI), feature selection, imbalanced dataset, probability method

中图分类号: