Abstract:
An Implementation of incremental Web Crawler that supports update of search engine over millions of Web pages on daily basis is introduced. With analysis on the weakness of traditional periodic Crawler and difficulties in incremental Web Crawler, this paper presents key strategies on prediction of Web evolution, algorithms of locating changed Web pages based on MD5, URL scheduling and caching, describes the implementation, and evaluates the Crawler system. The incremental crawler has been integrated with TianWang search engine at Peking University for 6 months. Update cycle is reduced by 20 days, accuracy of evolution prediction reaches 79.4%, and real-time efficiency, extendibility and stability are improved.
Key words:
incremental Crawler,
Web evolution prediction,
search engine
摘要: 针对传统的周期性集中式搜索(Crawler)的弱点和增量式Crawler的难点,提出预测更新策略,给出判别网页更新的MD5算法、URL调度算法和URL缓存算法,描述系统各个模块的分布式构架的实现,建立测试集数据对算法进行评测。该系统在北大天网搜索引擎上运行半年多,更新周期缩短了20天,变化预测命中率达到79.4%,提高了时效性、扩展性和稳定性。
关键词:
增量式搜集,
网页变化预测,
搜索引擎
CLC Number:
LEI Kai; WANG Dong-hai. Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine[J]. Computer Engineering, 2008, 34(13): 78-80,1.
雷 凯;王东海. 搜索引擎增量式搜集的实现与评测[J]. 计算机工程, 2008, 34(13): 78-80,1.