摘要: 针对网页正文提取问题,提出一种基于分段因子的方法对网页源文件进行过滤得到纯文本段,将每段看作二维空间中的一个点,利用DBSCAN聚类算法对这些点进行聚类得到正文内容。该方法复杂度低,并且不依赖于网站布局风格,适应性强。对各大国内外新闻类网站进行实验,结果表明,该方法对中英文新闻类网站的正文提取效果明显,具有较高的平均准确率。
关键词:
主题爬虫,
正文提取,
DBSCAN算法,
密度
Abstract: For the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average accuracy and obvious effect.
Key words:
topic-focused crawler,
content extraction,
DBSCAN,
density
中图分类号:
欧阳佳, 林丕源. 基于DBSCAN算法的网页正文提取[J]. 计算机工程, 2011, 37(3): 64-66,69.
OU Yang-Jia, LIN Pi-Yuan. Webpage Content Extraction Based on DBSCAN[J]. Computer Engineering, 2011, 37(3): 64-66,69.