作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (9): 197-199,. doi: 10.3969/j.issn.1000-3428.2010.09.069

• 人工智能及识别技术 • 上一篇    下一篇

文本分类特征权重改进算法

台德艺,王 俊   

  1. (合肥学院机器视觉与智能控制技术重点实验室,合肥 230601)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-05-05 发布日期:2010-05-05

Improved Feature Weighting Algorithm for Text Categorization

TAI De-yi, WANG Jun   

  1. (Key Laboratory of Machine Vision and Intelligence Control Technology, Hefei University, Hefei 230601)
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-05-05 Published:2010-05-05

摘要: TF-IDF是一种在文本分类领域获得广泛应用的特征词权重算法,着重考虑了词频与逆文档频等因素,但无法把握特征词在类间与类内的分布情况。为提高在同类中频繁出现、类内均匀分布的具有代表性的特征词权重,引入特征词分布集中度系数改进IDF函数、用分散度系数进行加权,提出TF-IIDF-DIC权重函数。实验结果表明,基于TF-IIDF-DIC权重算法的K-NN文本分类宏平均F1值比TF-IDF算法提高了6.79%。

关键词: 向量空间模型, 文本分类, 特征权重, 特征分布

Abstract: TF-IDF as one of feature weighting schemes in Vector Space Model(VSM) is widely used and makes good results in the realm of text categorization. Although traditional algorithms consider about term frequency and inverse document frequency, Term Frequency/Inverse Document Frequency(TF-IDF) is oblivious to the term distribution information among and inside class. A new feature weighting algorithm based on the improved IDF and distribution coefficient is put forward to enhance the feature weighting of high frequency and homogeneous distribution in the same class. Experimental results show that compared with the conventional TF-IDF algorithm, f1 based on TF-IIDF-DIC raises by 6.79%.

Key words: Vector Space Model(VSM), text categorization, feature weighting, feature distribution

中图分类号: