Implementation and Evaluation of Incremental Crawler         Based on TianWang Search Engine

doi:10.3969/j.issn.1000-3428.2008.13.029

Computer Engineering ›› 2008, Vol. 34 ›› Issue (13): 78-80,1.

• Networks and Communications • Previous Articles Next Articles

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

LEI Kai, WANG Dong-hai

(Center for Internet Research and Engineering, Shenzhen Graduate School, Peking University, Shenzhen 518055)

Received:1900-01-01 Revised:1900-01-01 Online:2008-07-05 Published:2008-07-05

搜索引擎增量式搜集的实现与评测

雷　凯，王东海

(北京大学深圳研究生院互联网研发中心，深圳 518055)

Abstract

Abstract: An Implementation of incremental Web Crawler that supports update of search engine over millions of Web pages on daily basis is introduced. With analysis on the weakness of traditional periodic Crawler and difficulties in incremental Web Crawler, this paper presents key strategies on prediction of Web evolution, algorithms of locating changed Web pages based on MD5, URL scheduling and caching, describes the implementation, and evaluates the Crawler system. The incremental crawler has been integrated with TianWang search engine at Peking University for 6 months. Update cycle is reduced by 20 days, accuracy of evolution prediction reaches 79.4%, and real-time efficiency, extendibility and stability are improved.

Key words: incremental Crawler, Web evolution prediction, search engine

摘要： 针对传统的周期性集中式搜索(Crawler)的弱点和增量式Crawler的难点，提出预测更新策略，给出判别网页更新的MD5算法、URL调度算法和URL缓存算法，描述系统各个模块的分布式构架的实现，建立测试集数据对算法进行评测。该系统在北大天网搜索引擎上运行半年多，更新周期缩短了20天，变化预测命中率达到79.4%，提高了时效性、扩展性和稳定性。

关键词: 增量式搜集, 网页变化预测, 搜索引擎

CLC Number:

TP393

LEI Kai; WANG Dong-hai. Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine[J]. Computer Engineering, 2008, 34(13): 78-80,1.

雷　凯;王东海. 搜索引擎增量式搜集的实现与评测[J]. 计算机工程, 2008, 34(13): 78-80,1.

/ Recommend / Download Citations

URL:

https://www.ecice06.com/EN/Y2008/V34/I13/78

[1]	ZHANG Haoshenglun,LI Chong,KE Yong,ZHANG Shibo. A Distributed User Browse Click Model Algorithm [J]. Computer Engineering, 2019, 45(3): 1-6.
[2]	YANG Zhenglong, GAO Jianhua. User-oriented Performance Analysis of Search Engine Based on Metamorphic Test [J]. Computer Engineering, 2019, 45(10): 52-56,63.
[3]	YAN Rui,LI Shijun. Document Retrieval Algorithm Based on Query Intent Identification and Topic Modeling [J]. Computer Engineering, 2018, 44(3): 189-194.
[4]	WANG Lin,LIU Jiyuan,MA Anjin. Personalization Sorting Algorithm Based on Interest Attenuation [J]. Computer Engineering, 2017, 43(9): 214-219,227.
[5]	ZHANG Naizhou. Query Suggestion Method Based on Temporal Click Graph Mining [J]. Computer Engineering, 2015, 41(5): 191-196.
[6]	PING Yu, XIANG Yang, ZHANG Bo, HUANG Yin-fei. Implementation of Parallel PageRank Algoirthm Based on MapReduce [J]. Computer Engineering, 2014, 40(2): 31-34,38.
[7]	ZHANG Xu-dong, SUN Zhi-ming, LIU Ya-ning, SHAN Dong-dong, YAN Hong-fei. Inverted Index Compression Algorithms Based on 64-bit Architecture [J]. Computer Engineering, 2014, 40(2): 71-76.
[8]	FANG Shuang,YIN Junjie,XU Wuping. Web Text Feature Algorithm Based on Similar Image Clustering [J]. Computer Engineering, 2014, 40(12): 161-165,171.
[9]	WANG Dong, NIU Jun-Yu. Entity Retrieval Method Based on Multi-perspective Association Model [J]. Computer Engineering, 2013, 39(1): 71-75.
[10]	YIN Mei-Juan, WANG Qing-Xian, LIU Xiao-Nan. Web Social Relation Evaluation Combining with Web Page Co-occurrence and Sentence Co-occurrence [J]. Computer Engineering, 2012, 38(22): 34-38.
[11]	XI Tian-Feng, CHEN Qi-An. Comparison Research of Segmentation Performance for Chinese Analyzers Based on Lucene [J]. Computer Engineering, 2012, 38(22): 279-282.
[12]	DIAO Ke, DAI Feng, LI Yong-Jiang. Design and Implementation of Search Engine Based on Lucene [J]. Computer Engineering, 2011, 37(16): 39-41.
[13]	GUO Hai-Feng, CAO Lin. Improvement of Borda Voting Method in Meta Search Engine [J]. Computer Engineering, 2011, 37(01): 81-83.
[14]	CAI Xin-bao; GUO Ruo-fei; ZHAO Peng-peng; CUI Zhi-ming. Research on Web Forum Data Source Incremental Crawler [J]. Computer Engineering, 2010, 36(9): 285-287.
[15]	FU Zhong-liang; ZHANG Wen-yuan; LIU Wei-guo; GAO Yan-mei. Application of WebGIS in 114 Phone Navigation [J]. Computer Engineering, 2010, 36(8): 281-283.

Please choose a citation manager

Content to export

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

搜索引擎增量式搜集的实现与评测

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Implementation and Evaluation of Incremental Crawler Based on TianWang Search Engine

搜索引擎增量式搜集的实现与评测

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments