作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2012, Vol. 38 ›› Issue (08): 37-40. doi: 10.3969/j.issn.1000-3428.2012.08.013

• 软件技术与数据库 • 上一篇    下一篇

基于信息增益与信息熵的TFIDF算法

李学明,李海瑞,薛 亮,何光军   

  1. (重庆大学计算机学院,重庆 400044)
  • 收稿日期:2011-07-11 出版日期:2012-04-20 发布日期:2012-04-20
  • 作者简介:李学明(1967-),男,副教授,主研方向:数据挖掘,网格计算;李海瑞、薛 亮、何光军,硕士研究生
  • 基金资助:
    中央高校基本科研业务费专项基金资助项目(CDJXS111 80009)

TFIDF Algorithm Based on Information Gain and Information Entropy

LI Xue-ming, LI Hai-rui, XUE Liang, HE Guang-jun   

  1. (College of Computer Science, Chongqing University, Chongqing 400044, China)
  • Received:2011-07-11 Online:2012-04-20 Published:2012-04-20

摘要: 传统的特征词权重算法TFIDF忽略了特征词在类内、类间的分布对其权重的影响。针对该问题,引入信息熵的概念,对基于信息增益的TFIDF算法(TFIDFIG)进行改进,提出一种基于信息增益与信息熵的TFIDF算法(TFIDFIGE)。实验结果表明,与传统的TFIDF算法和TFIDFIG算法相比,TFIDFIGE算法的查准率和查全率较高。

关键词: 文本分类, 信息增益, 信息熵, TFIDF算法

Abstract: The classical Term Frequency and Inverse Documentation Frequency(TFIDF) algorithm neglects the proportion of distribution of terms in categories and between categories of the text collection. Aiming at this problem, this paper introduces the information entropy, and the TFIDF algorithm based on information gain(TFIDFIG) is improved. It proposes a TFIDF algorithm based on information gain and information entropy (TFIDFIGE). Experimental results show that the TFIDFIGE algorithm is more effective than the traditional algorithm, namely TFIDF, TFIDFIG, in terms of precision and recall.

Key words: text classification, information gain, information entropy, Term Frequency and Inverse Documentation Frequency(TFIDF)

中图分类号: