作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (06): 80-82. doi: 10.3969/j.issn.1000-3428.2007.06.028

• 软件技术与数据库 • 上一篇    下一篇

WWW论坛中的动态网页采集

李 魁1,2,程学旗1,郭 岩1,张 凯1   

  1. (1. 中国科学院计算技术研究所,北京 100080;2. 中国科学院研究生院,北京 100039)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-03-20 发布日期:2007-03-20

Crawling Dynamic Web Pages in WWW Forums

LI Kui 1,2, CHENG Xueqi 1, GUO Yan 1, ZHANG Kai 1   

  1. (1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100080;
    2. Graduate School, Chinese Academy of Sciences, Beijing 100039)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-03-20 Published:2007-03-20

摘要: 网络论坛已经成为互联网信息发布的主要形式,对论坛信息的检索和挖掘都涉及到论坛信息的获取,然而传统的针对静态网页的广度优先采集工具,不能有效地获取论坛信息。该文利用论坛的结构特点,提出了一种“版面-主题关联判断”(BTCJ)算法,采用一种基于版面扩展的采集策略。实验证明,该方法在论坛采集准确率和覆盖率方面显著优于广度优先策略;具有良好的泛化能力,应用在实践中已覆盖各种类型的论坛12 000余个。

关键词: 互联网论坛, 信息采集, 动态网页

Abstract: Web Forums have been one of dominating ways for information release and exchange in Internet. Crawling is the groundwork of searching and mining information from Web Forums. However, traditional crawling component usually using “Broad-first” strategy can not fetch information from Web Forums effectively. Exploring inner structure-features of forums, this paper presents a crawling strategy, which is based on “board-topic correlation judgments” algorithm. Compared with “board-first” strategy, this solution performs remarkably better both in precisions and recall. In practice, the algorithm is performed over 12 000 different Web forums and achieves a good result.

Key words: WWW forums, Information crawling, Dynamic Web page

中图分类号: