作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2006, Vol. 32 ›› Issue (19): 76-78. doi: 10.3969/j.issn.1000-3428.2006.19.028

• 软件技术与数据库 • 上一篇    下一篇

基于文本分类TFIDF方法的改进与应用

张玉芳1,彭时名1,吕 佳2   

  1. (1. 重庆大学计算机学院,重庆 400045;2. 重庆师范大学数学与计算机科学学院,重庆 400047)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-10-05 发布日期:2006-10-05

Improvement and Application of TFIDF Method Based on Text Classification

ZHANG Yufang1, PENG Shiming1, LV Jia2   

  1. (1. Department of Computer Science, Chongqing University, Chongqing 400045; 2. College of Mathematics and Computer Science, Chongqing Normal University, Chongqing 400047)

  • Received:1900-01-01 Revised:1900-01-01 Online:2006-10-05 Published:2006-10-05

摘要: TFIDF是文档特征权值表示常用方法。该方法简单易行,但低估了在一个类中频繁出现的词条,该词条是能够代表这个类的文本特征的,应该赋予其较高的权重。通过修改TFIDF中IDF的表达式,来增加那些在一个类中频繁出现的词条的权重,用改进的TFIDF选择特征词条、用遗传算法训练分类器来验证其有效性。该方法优于其它算法,实验表明了改进的策略是可行的。

关键词: 文本分类, 特征选择, TFIDF, 类别区分

Abstract: TFIDF is a kind of common methods used to measure the terms in a document. The method is easy but it undervalues these terms that frequently appear in the documents belonging to the same class, while those terms can represent the characteristic of the documents of this class, so higher weight is entrusted to them. The expression of IDF in TFIDF is modified to increase the weight of those terms mentioned, then is applied to the experiment to validate it. In the experiment, the improved TFIDF is used to select feature and genetic algorithm is used to train the classifier. The method is better than others and proves that the improved TFIDF method is feasible.

Key words: Text classification, Feature selection, TFIDF, Class discrimination

中图分类号: