Abstract:
A lot of pages on Internet are generated dynamically by the back-end databases, which can not be reached by the traditional search engines called Deep Web. This paper proposes an algorithm of Deep Web sources focused crawling. When evaluating the importance of hyperlinks, it takes into consideration relevance among page, topic, and link-related information. Experiments indicate that this method is effective.
Key words:
Deep Web sourtes,
focused crawler,
Bayes classifier
摘要: Internet上有大量页面是由后台数据库动态产生的,这部分页面不能通过传统的搜索引擎访问,被称为Deep Web。数据源发现是大规模Deep Web数据源集成的关键步骤。该文提出一种针对Deep Web数据源的聚焦爬行算法。在评价链接重要性时,综合考虑了页面与主题的相关性和链接相关信息。实验证明该方法是有效的。
关键词:
Deep Web数据源,
聚焦爬虫,
贝叶斯分类器
CLC Number:
LIN Chao; ZHAO Peng-peng; CUI Zhi-ming. Deep Web Sources Focused Crawler[J]. Computer Engineering, 2008, 34(7): 56-58.
林 超;赵朋朋;崔志明. Deep Web数据源聚焦爬虫[J]. 计算机工程, 2008, 34(7): 56-58.