摘要: 传统的特征词权重算法TFIDF忽略了特征词在类内、类间的分布对其权重的影响。针对该问题,引入信息熵的概念,对基于信息增益的TFIDF算法(TFIDFIG)进行改进,提出一种基于信息增益与信息熵的TFIDF算法(TFIDFIGE)。实验结果表明,与传统的TFIDF算法和TFIDFIG算法相比,TFIDFIGE算法的查准率和查全率较高。
关键词:
文本分类,
信息增益,
信息熵,
TFIDF算法
Abstract: The classical Term Frequency and Inverse Documentation Frequency(TFIDF) algorithm neglects the proportion of distribution of terms in categories and between categories of the text collection. Aiming at this problem, this paper introduces the information entropy, and the TFIDF algorithm based on information gain(TFIDFIG) is improved. It proposes a TFIDF algorithm based on information gain and information entropy (TFIDFIGE). Experimental results show that the TFIDFIGE algorithm is more effective than the traditional algorithm, namely TFIDF, TFIDFIG, in terms of precision and recall.
Key words:
text classification,
information gain,
information entropy,
Term Frequency and Inverse Documentation Frequency(TFIDF)
中图分类号:
李学明, 李海瑞, 薛亮, 何光军. 基于信息增益与信息熵的TFIDF算法[J]. 计算机工程, 2012, 38(08): 37-40.
LI Hua-Meng, LI Hai-Rui, XUE Liang, HE Guang-Jun. TFIDF Algorithm Based on Information Gain and Information Entropy[J]. Computer Engineering, 2012, 38(08): 37-40.