作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (9): 222-224. doi: 10.3969/j.issn.1000-3428.2008.09.080

• 人工智能及识别技术 • 上一篇    下一篇

文本分类中影响因素的定量分析

高影繁1,马润波2,刘玉树1   

  1. (1. 北京理工大学计算机科学与技术学院,北京 100081;2. 山西大学物理电子工程学院,太原 030006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-05-05 发布日期:2008-05-05

Quantitative Analysis of Impact Factors in Text Categorization

GAO Ying-fan1, MA Run-bo2, LIU Yu-shu1   

  1. (1. School of Computer Science & Technology, Beijing Institute of Technology, Beijing 100081; 2. College of Physics and Electronics, Shanxi University, Taiyuan 030006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-05-05 Published:2008-05-05

摘要: 基于包含全部特征的类别特征数据库,利用基于距离度量的Rocchio算法、Fast TC算法和基于概率模型的NB算法,从定量的角度来分析停用词、词干合并、数字和测试文档长度4个因素对文本分类精度的影响程度。实验表明,过滤停用词方法是一种无损的特征压缩手段,词干合并虽然对分类精度略有减弱,但仍能保证特征压缩的可行性。数字与其他词汇的语义关联性提高了Rocchio算法和Fast TC算法的分类精度,但降低了视特征彼此独立的NB算法的分类精度。3种算法在测试文档取不同数量的关键词时分类精度的变化趋势说明了特征所包含的有益信息和噪音信息对分类精度的影响。

关键词: 类别特征信息库, 影响因素, 分类效率

Abstract: This experiment is based on the category-feature database which includes all features of the training set and three text categorization algorithms: Rocchio, Fast TC and NB. The experiment analyzes quantificationally the effect on effectiveness of text categorization produced by stopwords list, stemming, digital and testing text length. The experimental results show that stopwords list has no effect on effectiveness of TC; stemming has some effect but cut little figure; the correlation of digital and other words of documents makes the effectiveness of Rocchio and TC higher but lower NB and different testing text length describes the effect of beneficial and noisy information on effectiveness of TC. These popular feature selection methods are connected with the result of text categorization tightly.

Key words: category-feature database, impact factors, effectiveness of categorization

中图分类号: