作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (6): 79-81. doi: 10.3969/j.issn.1000-3428.2011.06.028

• 软件技术与数据库 • 上一篇    下一篇

中文RSS信息自动检索与分类研究

李庆诚,左珊珊,董振华,张 金   

  1. (南开大学信息技术科学学院,天津 300071)
  • 出版日期:2011-03-20 发布日期:2011-03-29
  • 作者简介:李庆诚(1964-),男,教授、博士生导师,主研方向:嵌入式系统,信息安全;左珊珊,硕士;董振华,博士;张 金,讲师
  • 基金资助:

    天津市软件产业发展专项基金资助项目(07FZRJFX01300

Research on Automatic Retrieval and Classification for Chinese RSS Information

LI Qing-cheng, ZUO Shan-shan, DONG Zhen-hua, ZHANG Jin   

  1. (College of Information Technical Science, Nankai University, Tianjin 300071, China)
  • Online:2011-03-20 Published:2011-03-29

摘要:

设计并实现了RSS垂直爬虫算法,利用广度优先算法聚焦于RSS源进行自动采集,并在文本分词基础上,针对RSS源进行词语权重计算方法的改进及词语过滤,利用VSM方法实现RSS自动分类。实验结果表明,RSS系统在较低的负载下,能以较高的效率和正确率实现中文RSS信息自动检索与分类,从而有效进行RSS信息聚合管理。

关键词: RSS, 信息检索, 爬虫, 中文文本分类, 向量空间模型

Abstract:

This paper presents a web crawler fitting for RSS which uses breadth-first algorithm and focuses on RSS to carry out automatically collection. And based on word segment, it improves the method to calculate word weight, works on word filtering, and implements automatically classification aiming at RSS using VSM. Experimental result shows that the system achieves to retrieve and classify Chinese RSS information with lower system cost and higher accuracy. And it can take manage of RSS information syndication effectively.

Key words: Really Simple Syndication(RSS), information retrieval, crawler, Chinese text classification, VSM

中图分类号: