Webpage Content Extraction Based on DBSCAN

doi:10.3969/j.issn.1000-3428.2011.03.023

Computer Engineering ›› 2011, Vol. 37 ›› Issue (3): 64-66,69. doi: 10.3969/j.issn.1000-3428.2011.03.023

• Networks and Communications • Previous Articles Next Articles

Webpage Content Extraction Based on DBSCAN

OUYANG Jia, LIN Pi-yuan

(College of Informatics, South China Agricultural University, Guangzhou 510642, China)

Online:2011-02-05 Published:2011-01-28

基于DBSCAN算法的网页正文提取

欧阳佳，林丕源

(华南农业大学信息学院，广州 510642)

作者简介:欧阳佳(1986－)，男，硕士研究生，主研方向：数据挖掘；林丕源，教授
基金资助:
国家自然科学基金资助项目(60573043)

Abstract

Abstract: For the problem of webpage content extraction, this paper presents a method based on section-factor to filter webpage and get the plain text paragraph. Each paragraph is regarded as a point in the two-dimensional space. The DBSCAN clustering algorithm can cluster these points to get the real content. This method has low complexity and does not depend on the site layout style, as well as has strong adaptability. Experiments are put on the news websites from domestic and international, and results show that for both Chinese and English news website has a high average accuracy and obvious effect.

Key words: topic-focused crawler, content extraction, DBSCAN, density

摘要： 针对网页正文提取问题，提出一种基于分段因子的方法对网页源文件进行过滤得到纯文本段，将每段看作二维空间中的一个点，利用DBSCAN聚类算法对这些点进行聚类得到正文内容。该方法复杂度低，并且不依赖于网站布局风格，适应性强。对各大国内外新闻类网站进行实验，结果表明，该方法对中英文新闻类网站的正文提取效果明显，具有较高的平均准确率。

关键词: 主题爬虫, 正文提取, DBSCAN算法, 密度

CLC Number:

TP18

OU Yang-Jia, LIN Pi-Yuan. Webpage Content Extraction Based on DBSCAN[J]. Computer Engineering, 2011, 37(3): 64-66,69.

欧阳佳, 林丕源. 基于DBSCAN算法的网页正文提取[J]. 计算机工程, 2011, 37(3): 64-66,69.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2011.03.023

http://www.ecice06.com/EN/Y2011/V37/I3/64

[1]	Dazhi PAN, Yan JIANG, Yawen LIU. Double-Decision Interactive Diversity Algorithm for Solving Multidimensional Knapsack Problems [J]. Computer Engineering, 2023, 49(7): 21-33.
[2]	WEI Ya, ZHANG Zhengjun, HE Kailin, TANG Li. Density Peak Clustering Algorithm Based on Relative Density [J]. Computer Engineering, 2023, 49(6): 53-61.
[3]	HUANG Yiqiu, HU Xiao, YANG Jiaxin, OU Jiamin. Crowd Counting Network Based on Background Suppression and Context Awareness [J]. Computer Engineering, 2022, 48(9): 314-320.
[4]	CAO Ruiyang, GUO Youmin, NIU Manyu. Integrated Enhancement Method for Multi-Center Data Based on Max-Min Distance [J]. Computer Engineering, 2022, 48(6): 174-181.
[5]	GUO Aixin, XIA Yinfeng, WANG Dawei, LU Bin. A Multi-scale Crowd Counting Algorithm with Removing Background Interference [J]. Computer Engineering, 2022, 48(5): 251-257.
[6]	WANG Fuyin, ZHANG Desheng, XIAO Yanting. Density Peak Algorithm Based on Weighted Shared Nearest Neighbor and Accumulated Sequence [J]. Computer Engineering, 2022, 48(4): 61-69.
[7]	ZENG Xi, HAN Hua, MA Yuanyuan. Naive Bayes Link Prediction Method Based on Motif [J]. Computer Engineering, 2022, 48(10): 95-102.
[8]	WANG Zhihe, CAO Xuyan, DU Hui. A Density Clustering Algorithm with Optimized Initial Points and Adaptive Radius [J]. Computer Engineering, 2022, 48(1): 51-59.
[9]	WANG Zijiao, WANG Xiaodan. HRRP Expansion Method Based on EMD-MDGAN [J]. Computer Engineering, 2021, 47(9): 259-265.
[10]	GE Junwei, YANG Guangxin. Spectral Clustering Algorithm for Density Adaptive Neighborhood Based on Shared Nearest Neighbors [J]. Computer Engineering, 2021, 47(8): 116-123.
[11]	SHAO Lijie, MA Fumin. Two-Phase Information Granulation Combined with Interval Type-2 FRCM and Mixed Metrics [J]. Computer Engineering, 2021, 47(6): 88-97.
[12]	WANG Zhihe, WANG Shuyan, DU Hui. Improved Fuzzy C-means Clustering Algorithm Based on Density-Sensitive Distance [J]. Computer Engineering, 2021, 47(5): 88-96,103.
[13]	YU Qingying, ZHAO Yajun, YE Zitong, HU Fan, XIA Yun. Trajectory Clustering Algorithm Based on Group and Density [J]. Computer Engineering, 2021, 47(4): 100-107.
[14]	SUN Jingyong, MA Fumin. Rough K-Means Algorithm Based on Mixed Measure of Neighborhood Partition Information [J]. Computer Engineering, 2021, 47(3): 109-116.
[15]	CHEN Lüe, XIONG Chen, CAI Ming. Recognition Algorithm for Space-Time Density Track Points of Celluar Signaling [J]. Computer Engineering, 2021, 47(3): 83-93.

Please choose a citation manager

Content to export

Webpage Content Extraction Based on DBSCAN

基于DBSCAN算法的网页正文提取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Webpage Content Extraction Based on DBSCAN

基于DBSCAN算法的网页正文提取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments