作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (7): 56-58. doi: 10.3969/j.issn.1000-3428.2008.07.019

• 软件技术与数据库 • 上一篇    下一篇

Deep Web数据源聚焦爬虫

林 超,赵朋朋,崔志明   

  1. (苏州大学智能信息处理及应用研究所,苏州 215006)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-04-05 发布日期:2008-04-05

Deep Web Sources Focused Crawler

LIN Chao, ZHAO Peng-peng, CUI Zhi-ming   

  1. (Institute of Intelligent Information Processing and Application, Suzhou University, Suzhou 215006)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-04-05 Published:2008-04-05

摘要: Internet上有大量页面是由后台数据库动态产生的,这部分页面不能通过传统的搜索引擎访问,被称为Deep Web。数据源发现是大规模Deep Web数据源集成的关键步骤。该文提出一种针对Deep Web数据源的聚焦爬行算法。在评价链接重要性时,综合考虑了页面与主题的相关性和链接相关信息。实验证明该方法是有效的。

关键词: Deep Web数据源, 聚焦爬虫, 贝叶斯分类器

Abstract: A lot of pages on Internet are generated dynamically by the back-end databases, which can not be reached by the traditional search engines called Deep Web. This paper proposes an algorithm of Deep Web sources focused crawling. When evaluating the importance of hyperlinks, it takes into consideration relevance among page, topic, and link-related information. Experiments indicate that this method is effective.

Key words: Deep Web sourtes, focused crawler, Bayes classifier

中图分类号: