作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (13): 78-80,1. doi: 10.3969/j.issn.1000-3428.2008.13.029

• 网络与通信 • 上一篇    下一篇

搜索引擎增量式搜集的实现与评测

雷 凯,王东海   

  1. (北京大学深圳研究生院互联网研发中心,深圳 518055)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-07-05 发布日期:2008-07-05

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

LEI Kai, WANG Dong-hai   

  1. (Center for Internet Research and Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-07-05 Published:2008-07-05

摘要: 针对传统的周期性集中式搜索(Crawler)的弱点和增量式Crawler的难点,提出预测更新策略,给出判别网页更新的MD5算法、URL调度算法和URL缓存算法,描述系统各个模块的分布式构架的实现,建立测试集数据对算法进行评测。该系统在北大天网搜索引擎上运行半年多,更新周期缩短了20天,变化预测命中率达到79.4%,提高了时效性、扩展性和稳定性。

关键词: 增量式搜集, 网页变化预测, 搜索引擎

Abstract: An Implementation of incremental Web Crawler that supports update of search engine over millions of Web pages on daily basis is introduced. With analysis on the weakness of traditional periodic Crawler and difficulties in incremental Web Crawler, this paper presents key strategies on prediction of Web evolution, algorithms of locating changed Web pages based on MD5, URL scheduling and caching, describes the implementation, and evaluates the Crawler system. The incremental crawler has been integrated with TianWang search engine at Peking University for 6 months. Update cycle is reduced by 20 days, accuracy of evolution prediction reaches 79.4%, and real-time efficiency, extendibility and stability are improved.

Key words: incremental Crawler, Web evolution prediction, search engine

中图分类号: