Author Login Chief Editor Login Reviewer Login Editor Login Remote Office

Computer Engineering ›› 2006, Vol. 32 ›› Issue (3): 57-59.

• Software Technology and Database • Previous Articles     Next Articles

A Topic-specific Intelligent Web Crawler System

QIAN Rong1, XU Xinhua2, ZHENG Ying3, YANG Bingru1   

  1. 1.School of Information Engineering, Beijing University of Science and Technology, Beijing 100083; 2.Guanzhuang Campus, Beijing University of Science and Technology, Beijing 100083; 3. Personnel Department, Jinan University, Jinan 250022
  • Online:2006-02-05 Published:2006-02-05

智能专题化信息搜集 Crawler

钱榕 1,徐新华2,郑莹 3,杨炳儒1   

  1. 1.北京科技大学信息工程学院,北京 100083;2. 北京科技大学管庄校区信息工程系,北京100083;3. 济南大学人事处,济南 250022

Abstract: This paper introduces the topic-specific intelligent Web Crawler system and its crawling algorithm based on Web content and structure mining. The algorithm takes full advantage of the characteristics of the neural network and can simulate the network topology conveniently and parallel calculation. The paper introduces the reinforcement learning to judge the relativity between the crawled page and the topic. When calculating the correlation, without regarding to the whole content of the Web page, but to abstract the important tags of HTML makeup of the Web page, to analyze the content and structure of the page, thereby judge the relativity between the crawled page and the topic, improve the efficiency and accuracy of collected information enormously.

Key words: Topic-specific crawler; Web mining; Neural network; Reinforcement learning

摘要: 介绍了基于Web 内容和结构挖掘的专题化智能Web 爬行Crawler 系统,并重点介绍其中CA(C&S)算法,该算法充分利用神经网络可以方便地模拟网络的拓扑结构和并行计算的特点,采用加强学习判断网页与主题的相关度,在进行相关度计算时,不考虑网页的全部内容,而通过提取网页的HTML 描述中的重要标记,对Web 网页进行内容和结构分析,从而判断爬行到的网页与主题的相关性,以提高信息搜集的效率和精确性

关键词: 专题化爬行;Web 挖掘;神经网络;加强学习