摘要: 为实现社会媒体网页内容的分割与抽取,利用k-means算法识别出页面的频繁块并形成一个频繁簇集合,找出该集合中的主题频繁簇,对其中的频繁块结构进行自学习,无需训练样本,即可自动生成抽取规则。实验结果表明,该方法能抽取各种风格的社会媒体网页内容,具有较高的准确率和召回率。
关键词:
社会媒体,
DOM结构,
k-means算法,
自学习,
抽取规则,
网页内容抽取
Abstract: This paper presents a segmentation and extraction method which does not need any hand-crafted rules and training examples for content-rich pages in social media. It identifies the frequent blocks in page by using k-means algorithm and obtains a collection of frequent clusters. It identifies the topic frequent clusters and induces extraction rules from the frequent blocks in topic frequent clusters through self-supervised approach. Experimental results show that it is efficient and robust for social media Web pages with various styles and layouts with high precision and recall rate.
Key words:
social media,
DOM structure,
k-means algorithm,
self-learning,
extraction rule,
Web page content extraction
中图分类号:
解姝, 叶施仁, 肖春. 社会媒体网页内容的分割与抽取[J]. 计算机工程, 2011, 37(21): 155-158.
JIE Shu, XIE Shi-Ren, XIAO Chun. Segmentation and Extraction for Social Media Web Page Content[J]. Computer Engineering, 2011, 37(21): 155-158.