Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2011, Vol. 37 ›› Issue (3): 64-66,69. doi: 10.3969/j.issn.1000-3428.2011.03.023

• Networks and Communications • Previous Articles     Next Articles

Webpage Content Extraction Based on DBSCAN

OUYANG Jia, LIN Pi-yuan   

  1. (College of Informatics, South China Agricultural University, Guangzhou 510642, China)
  • Online:2011-02-05 Published:2011-01-28

基于DBSCAN算法的网页正文提取

欧阳佳,林丕源   

  1. (华南农业大学信息学院,广州 510642)
  • 作者简介:欧阳佳(1986-),男,硕士研究生,主研方向:数据挖掘;林丕源,教授
  • 基金资助:
    国家自然科学基金资助项目(60573043)

Abstract: For the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average accuracy and obvious effect.

Key words: topic-focused crawler, content extraction, DBSCAN, density

摘要: 针对网页正文提取问题,提出一种基于分段因子的方法对网页源文件进行过滤得到纯文本段,将每段看作二维空间中的一个点,利用DBSCAN聚类算法对这些点进行聚类得到正文内容。该方法复杂度低,并且不依赖于网站布局风格,适应性强。对各大国内外新闻类网站进行实验,结果表明,该方法对中英文新闻类网站的正文提取效果明显,具有较高的平均准确率。

关键词: 主题爬虫, 正文提取, DBSCAN算法, 密度

CLC Number: