作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (22): 184-186. doi: 10.3969/j.issn.1000-3428.2010.22.066

• 人工智能及识别技术 • 上一篇    下一篇

基于关联分析的文本分类特征选择算法

张 彪1,2,刘贵全1,2   

  1. (1. 中国科学技术大学计算机科学与技术学院,合肥 230027;2. 安徽省计算与通讯软件重点实验室,合肥 230027)
  • 出版日期:2010-11-20 发布日期:2010-11-18
  • 作者简介:张 彪(1981-),男,硕士研究生,主研方向:机器学习,数据挖掘;刘贵全,副教授、博士

Feature Selection Method Based on Association Analysis for Text Classification

ZHANG Biao1,2, LIU Gui-quan1,2   

  1. (1. School of Computer Science and Technology, University of Science and Technology of China, Hefei 230027, China;2. Anhui Province Key Laboratory for Computing and Communication Software, Hefei 230027, China)
  • Online:2010-11-20 Published:2010-11-18

摘要: 提出一种在选取特征时考虑特征与特征之间联系的算法。对特征词之间的关联关系进行挖掘,找出那些对类别有重要影响的特征词组,而这些特征词组中的每个单词在传统单独打分策略的特征选择算法中很可能会因分值过低而被丢弃。在Ruters21578、20Newsgroup文本数据集上进行实验,将算法与广泛应用的特征选择算法(信息增益、CHI等)进行对比、分析。实验结果表明该方法是一种有特点、有效的特征选择方法。

关键词: 特征选择, 交叉熵, 文本分类, 关联挖掘

Abstract: This paper proposes a method, which considers the relationship between two words in feature selection. The relationship between two words which have significant impact on classification is mined, and two-word-sets are found out. Some words in these sets may be discarded due to low scores achieved through the conventional feature selection methods. The algorithm is compared with other conventional feature selection approaches: Information Gain(IG), CHI, etc. Experimental results on Ruters21578 dataset and 20Newsgroup dataset prove that the proposed method is effective to others.

Key words: feature selection, cross-entropy, text classification, association mining

中图分类号: