摘要: 提出了一个基于多线程并行的增量式Web信息采集结构模型,并加以实现,该模型以线程并行的方式对Web页面同时采集,实现了全面、高效并且灵活的信息搜集,在系统实现过程中,采取Java语言中最新的特性、独特的URL调度策略保证了各个线程时间的下载并行与互不相交,页面分析过程为各个线程源源不断地提供下载源,而指纹判别算法保证了并行采集过程中的同步,有效地去除了冗余。对该系统作了测试,实验证明,该系统能有效地提高信息采集性能。
关键词:
Web,
信息采集,
搜索引擎,
并行
Abstract: This paper gets into the research on how to crawl information effectively in some sections of Web, which is also called parallel Web crawling technology, and brings forward a structure design model of the parallel incremental Web crawler. In order to download Web pages in parallel, the means of multiple thread and the latest character of Java language are adopted, meanwhile the paper adopts the right means for URL dispatching to make sure that threads would work in parallel with page analysis. In order to reduce redundancy, the method chooses footprint algorithm and extracts URL for threads to download. The test result proves the expect. It can effectively improve information gathering performance.
Key words:
Web,
Information gathering,
Search engine,
Parallel
杨天奇;周 晔. 一种增量式并行Web信息采集方法[J]. 计算机工程, 2006, 32(20): 97-99.
YANG Tianqi; ZHOU Ye. A Parallel System of Incremental Web Information Gathering[J]. Computer Engineering, 2006, 32(20): 97-99.