作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (22): 45-47. doi: 10.3969/j.issn.1000-3428.2007.22.016

• 博士论文 • 上一篇    下一篇

基于句类向量空间模型的自动文本分类研究

张运良1,2,张 全2   

  1. (1. 中国科学院研究生院,北京 100039;2. 中国科学院声学研究所,北京 100080)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-11-20 发布日期:2007-11-20

Research of Automatic Text Categorization Based on Sentence Category VSM

ZHANG Yun-liang1,2, ZHANG Quan2   

  1. (1. Graduate School, Chinese Academy of Sciences, Beijing 100039; 2. Institute of Acoustics, Chinese Academy of Sciences, Beijing 100080)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-11-20 Published:2007-11-20

摘要: 向量空间模型是自动文本分类中成熟的文本表示模型,通常以词语或短语作为特征项,但这些特征项通常只能提供较少的局部语义信息。为实现基于内容的文本分类,该文用HNC理论中的句类作为特征项,通过混合句类分解等技术对句类向量空间降维,使用tfc算法对特征项进行权重计算,用KNN算法进行分类。该分类器的平均准确率和召回率都是可接受的,对类别的抽象程度无要求,即抽象度较高和较低的类别可以同时分类。通过使用更好的机器学习算法和其他的HNC语言理解技术,性能可以进一步提高。

关键词: 文本分类, 句类, 向量空间模型, HNC理论

Abstract: Vector space model is a mature model of text representation in automatic text categorization. Words and phrases are commonly used as feature items, but these items provide little local semantic information. This paper uses sentence categories, which include more semantic information, as feature items. To reduce the dimensionality of sentence category vector space, it divides mixed sentence categories and reformes the weights by tfc algorithm and buildsthe classifier by KNN algorithm. The average precision and recall of the classifier are acceptable, especially in the case of categories having different abstraction. The performance can be improved by HNC techniques and machine learning algorithm.

Key words: text classification, sentence category, vector space model (VSM), HNC theory

中图分类号: