作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (18): 87-88. doi: 10.3969/j.issn.1000-3428.2008.18.031

• 软件技术与数据库 • 上一篇    下一篇

实时文本分类系统的研究与实现

黄 旭,朱艳琴,罗喜召   

  1. (苏州大学计算机科学与技术学院,苏州 215006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-09-20 发布日期:2008-09-20

Research and Implementation of Real-time Text Categorization System

HUANG Xu, ZHU Yan-qin, LUO Xi-zhao   

  1. (School of Computer Science and Technology, Soochow University, Suzhou 215006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-09-20 Published:2008-09-20

摘要: 分析文本分类过程中影响实时性的因素,即分词处理高耗时和特征空间维数过高问题。结合网页过滤的实时应用提出一种实时文本分类方法,弱化分词处理过程,降低特征空间维数,以提高分类速度。通过优化特征项选取维持分类效果,基于贝叶斯理论实现实时文本分类系统。实验结果表明,该方法在维持精确率和召回率分别为85%, 94%的情况下,显著提高了分类速度。

关键词: 信息安全, 内容安全, 文本分类

Abstract: This paper analyzes the factors which affect the quality of real-time in text categorization, that is the high time-consuming problem of word segmentation, and the excessively high dimension of character space. Based on the real-time application of Web filter, a real-time text categorization approach is proposed. The approach improves the rate of text categorization by reducing the processing of word segmentation and the dimension of character space. It maintains the effect of text categorization by optimizing the selection of character item, and implements a real-time text classifier based on Bayesian theory. Experimental results show that this approach improves the rate of text categorization effectively, and the precision and recall is maintained at 85 percent and 94 percent.

Key words: information security, content security, text categorization

中图分类号: