作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (3): 64-66,69. doi: 10.3969/j.issn.1000-3428.2011.03.023

• 软件技术与数据库 • 上一篇    下一篇

基于DBSCAN算法的网页正文提取

欧阳佳,林丕源   

  1. (华南农业大学信息学院,广州 510642)
  • 出版日期:2011-02-05 发布日期:2011-01-28
  • 作者简介:欧阳佳(1986-),男,硕士研究生,主研方向:数据挖掘;林丕源,教授
  • 基金资助:
    国家自然科学基金资助项目(60573043)

Webpage Content Extraction Based on DBSCAN

OUYANG Jia, LIN Pi-yuan   

  1. (College of Informatics, South China Agricultural University, Guangzhou 510642, China)
  • Online:2011-02-05 Published:2011-01-28

摘要: 针对网页正文提取问题,提出一种基于分段因子的方法对网页源文件进行过滤得到纯文本段,将每段看作二维空间中的一个点,利用DBSCAN聚类算法对这些点进行聚类得到正文内容。该方法复杂度低,并且不依赖于网站布局风格,适应性强。对各大国内外新闻类网站进行实验,结果表明,该方法对中英文新闻类网站的正文提取效果明显,具有较高的平均准确率。

关键词: 主题爬虫, 正文提取, DBSCAN算法, 密度

Abstract: For the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average accuracy and obvious effect.

Key words: topic-focused crawler, content extraction, DBSCAN, density

中图分类号: