作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2010, Vol. 36 ›› Issue (8): 286-288. doi: 10.3969/j.issn.1000-3428.2010.08.100

• 开发研究与设计技术 • 上一篇    下一篇

基于共现词查询的主题爬虫研究

葛 玲,蒋宗礼   

  1. (北京工业大学计算机学院,北京 100124)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-04-20 发布日期:2010-04-20

Research of Co-occurrence Words Search-based Topic Crawler

GE Ling, JIANG Zong-li   

  1. (College of Computer, Beijing University of Technology, Beijing 100124)
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-04-20 Published:2010-04-20

摘要: 通过建立一个共现词库改进主题模型,以提高下载网页的主题相关度及质量,并且能描述其语境的上下文,揣测用户意图,调节检索结果排序。在此基础上设计并实现一个FDC主题爬虫系统,该系统采用改进的主题敏感FDC-PageRank算法来计算网页优先级。实验表明其效果良好。

关键词: 主题爬虫, 共现词, FDC主题模型, FDC_Topic Sensitive PageRank算法

Abstract: This paper improves the topic mode through a co-occurrence words database. The topic mode can advance the rate of relationship and quality. Besides, it can describe the environment of key words, conjecture the purpose of users and adjust the rank of search result. A topic crawler system which employs topic sensitive FDC-PageRank to predict the priority of Web page is designed and implemented. Experiments show the system performs well.

Key words: topic crawler, co-occurrence words, FDC topic model, FDC_Topic Sensitive PageRank algorithm

中图分类号: