作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (01): 16-18,21. doi: 10.3969/j.issn.1000-3428.2011.01.006

• 博士论文 • 上一篇    下一篇

基于信息增益的文本特征权重改进算法

李凯齐1,2,刁兴春2,曹建军2   

  1. (1. 解放军理工大学指挥自动化学院,南京 210007;2. 总参第六十三研究所,南京 210007)
  • 出版日期:2011-01-05 发布日期:2010-12-31
  • 作者简介:李凯齐(1982-),男,博士研究生,主研方向:个人信息管理,文本分类,人工智能;刁兴春,研究员、博士生导师;曹建军,博士后
  • 基金资助:
    中国博士后科学基金资助项目(20090461425);江苏省博士后科研计划基金资助项目(0901014B)

Improved Algorithm of Text Feature Weighting Based on Information Gain

LI Kai-qi 1,2, DIAO Xing-chun 2, CAO Jian-jun 2   

  1. (1. Institute of Command Automation, PLA Univ. of Sci. & Tech., Nanjing 210007, China; 2. The 63rd Research Institute, PLA General Staff Headquarters, Nanjing 210007, China)
  • Online:2011-01-05 Published:2010-12-31

摘要: 传统tf.idf算法中的idf函数只能从宏观上评价特征区分不同文档的能力,无法反映特征在训练集各文档以及各类别中分布比例上的差异对特征权重计算结果的影响,降低文本表示的准确性。针对以上问题,提出一种改进的特征权重计算方法tf.igt.igC。该方法从考察特征分布入手,通过引入信息论中信息增益的概念,实现对上述特征分布具体维度的综合考虑,克服传统公式存在的不足。实验结果表明,与tf.idf.ig和tf.idf.igc 2种特征权重计算方法相比,tf.igt.igC在计算特征权重时更加有效。

关键词: 特征分布, 特征加权, 文本分类

Abstract: The idf function of traditional tf.idf algorithm can only evaluate the ability of features to discriminate different documents in a macroscopically way, which can not reflect the differences of distribution proportion for features in each document and each class of the whole training set, it reduces the accuracy of text representation. To solve the above problem, this paper proposes an improved feature weighting method called tf.igt.igC. This method begins from analyzing the characteristics of feature distribution, through introducing the concept of information gain in the information theory, realizes the comprehensive consideration of the two specific dimensions of feature distributions, and overcomes the shortcomings of the traditional formula. Experimental results on the two open source corpus show that compared to other two feature weighting methods, tf.igt.igC is more effective in terms of calculating the feature weighting.

Key words: feature distribution, feature weighting, text classification

中图分类号: