Abstract:
This paper introduces the topic-specific intelligent Web Crawler system and its crawling algorithm based on Web content and structure mining. The algorithm takes full advantage of the characteristics of the neural network and can simulate the network topology conveniently and parallel calculation. The paper introduces the reinforcement learning to judge the relativity between the crawled page and the topic. When calculating the correlation, without regarding to the whole content of the Web page, but to abstract the important tags of HTML makeup of the Web page, to analyze the content and structure of the page, thereby judge the relativity between the crawled page and the topic, improve the efficiency and accuracy of collected information enormously.
Key words:
Topic-specific crawler; Web mining; Neural network; Reinforcement learning
摘要: 介绍了基于Web 内容和结构挖掘的专题化智能Web 爬行Crawler 系统,并重点介绍其中CA(C&S)算法,该算法充分利用神经网络可以方便地模拟网络的拓扑结构和并行计算的特点,采用加强学习判断网页与主题的相关度,在进行相关度计算时,不考虑网页的全部内容,而通过提取网页的HTML 描述中的重要标记,对Web 网页进行内容和结构分析,从而判断爬行到的网页与主题的相关性,以提高信息搜集的效率和精确性
关键词:
专题化爬行;Web 挖掘;神经网络;加强学习
QIAN Rong, XU Xinhua, ZHENG Ying, YANG Bingru. A Topic-specific Intelligent Web Crawler System[J]. Computer Engineering, 2006, 32(3): 57-59.
钱榕,徐新华,郑莹,杨炳儒. 智能专题化信息搜集 Crawler[J]. 计算机工程, 2006, 32(3): 57-59.