作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2006, Vol. 32 ›› Issue (20): 97-99. doi: 10.3969/j.issn.1000-3428.2006.20.036

• 软件技术与数据库 • 上一篇    下一篇

一种增量式并行Web信息采集方法

杨天奇,周 晔   

  1. (暨南大学计算机科学系,广州 510632)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-10-20 发布日期:2006-10-20

A Parallel System of Incremental Web Information Gathering

YANG Tianqi, ZHOU Ye   

  1. (Department of Computer Science, Jinan University, Guangzhou 510632)
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-10-20 Published:2006-10-20

摘要: 提出了一个基于多线程并行的增量式Web信息采集结构模型,并加以实现,该模型以线程并行的方式对Web页面同时采集,实现了全面、高效并且灵活的信息搜集,在系统实现过程中,采取Java语言中最新的特性、独特的URL调度策略保证了各个线程时间的下载并行与互不相交,页面分析过程为各个线程源源不断地提供下载源,而指纹判别算法保证了并行采集过程中的同步,有效地去除了冗余。对该系统作了测试,实验证明,该系统能有效地提高信息采集性能。

关键词: Web, 信息采集, 搜索引擎, 并行

Abstract: This paper gets into the research on how to crawl information effectively in some sections of Web, which is also called parallel Web crawling technology, and brings forward a structure design model of the parallel incremental Web crawler. In order to download Web pages in parallel, the means of multiple thread and the latest character of Java language are adopted, meanwhile the paper adopts the right means for URL dispatching to make sure that threads would work in parallel with page analysis. In order to reduce redundancy, the method chooses footprint algorithm and extracts URL for threads to download. The test result proves the expect. It can effectively improve information gathering performance.

Key words: Web, Information gathering, Search engine, Parallel