摘要: TFIDF是文档特征权值表示常用方法。该方法简单易行,但低估了在一个类中频繁出现的词条,该词条是能够代表这个类的文本特征的,应该赋予其较高的权重。通过修改TFIDF中IDF的表达式,来增加那些在一个类中频繁出现的词条的权重,用改进的TFIDF选择特征词条、用遗传算法训练分类器来验证其有效性。该方法优于其它算法,实验表明了改进的策略是可行的。
关键词:
文本分类,
特征选择,
TFIDF,
类别区分
Abstract: TFIDF is a kind of common methods used to measure the terms in a document. The method is easy but it undervalues these terms that frequently appear in the documents belonging to the same class, while those terms can represent the characteristic of the documents of this class, so higher weight is entrusted to them. The expression of IDF in TFIDF is modified to increase the weight of those terms mentioned, then is applied to the experiment to validate it. In the experiment, the improved TFIDF is used to select feature and genetic algorithm is used to train the classifier. The method is better than others and proves that the improved TFIDF method is feasible.
Key words:
Text classification,
Feature selection,
TFIDF,
Class discrimination
中图分类号:
张玉芳;彭时名;吕 佳. 基于文本分类TFIDF方法的改进与应用[J]. 计算机工程, 2006, 32(19): 76-78.
ZHANG Yufang; PENG Shiming; LV Jia. Improvement and Application of TFIDF Method Based on Text Classification[J]. Computer Engineering, 2006, 32(19): 76-78.