作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于词频的优化互信息文本特征选择方法

刘海峰,姚泽清,苏 展   

  1. (解放军理工大学理学院,南京 210007)
  • 收稿日期:2013-06-13 出版日期:2014-07-15 发布日期:2014-07-14
  • 作者简介:刘海峰(1962-),男,教授、博士,主研方向:模式识别,文本挖掘;姚泽清,教授;苏 展,讲师、硕士。
  • 基金资助:
    国家自然科学基金资助项目(71071161, 61273209);江苏省自然科学基金资助项目(BK2012511)。

Optimization Mutual Information Text Feature Selection Method Based on Word Frequency

LIU Hai-feng, YAO Ze-qing, SU ZhanPPPP   

  1. (Institute of Sciences, PLA University of Science and Technology, Nanjing 210007, China)
  • Received:2013-06-13 Online:2014-07-15 Published:2014-07-14

摘要: 互信息(MI)是一种常用的文本特征选择方法,经典MI方法未考虑同一个特征项在不同类别内频数的差异性,也未考虑同一个特征在同一类别内的不同文本之间分布上的差异性。针对上述不足,以特征项的频数为依据,分别从特征项的类内分布、类间分布上的差异以及类内不同文本之间分布上的差异等角度,通过引入特征项的类内频数因子、类内位置分布因子以及类间分布因子,提出一种改进的MI文本特征选择方法,使得特征项的频数信息在MI模型中得到有效利用,合理改善互信息模型在文本特征选择方面的不足。文本分类实验结果表明,改进MI文本特征选择方法的平均准确率、召回率分别提高约5.2%及4.6%,平均综合评价指标值提高约4.9%,有效提高了模型的文本分类效率。

关键词: 文本分类, 特征选择, 互信息, 特征频率, 特征降维, 类内分布

Abstract: Mutual Information(MI) is a kind of text feature selection method commonly used. The classical mutual information method does not consider the same characteristic frequency in different categories of difference. And more, MI does not take into account the difference that the same feature in the same sort between different texts. Aiming at the shortcomings of MI model, the frequency feature as the basis, from the perspective of internal distribution category feature and from the point of the distribution among different types of feature, the model is optimized. Through the frequency factor and the factor distribution within class and the factor distribution between classes are introduced, the feature frequency information is used in the MI. This paper improves the MI efficiency in the text feature selection. Text classification experimental results show that the average accuracy rate, recall rate of the improved MI model are improved by about 5.2% and 4.6%. And more, the average F1 value increases by about 4.9%.

Key words: text classification, feature selection, Mutual Information(MI), feature frequency, feature dimension reduction, distribution within class

中图分类号: