作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (21): 155-158. doi: 10.3969/j.issn.1000-3428.2011.21.053

• 人工智能及识别技术 • 上一篇    下一篇

社会媒体网页内容的分割与抽取

解 姝1,叶施仁2,肖 春1   

  1. (1. 湘潭大学智能计算与信息处理教育部重点实验室,湖南 湘潭 411105;2. 常州大学信息学院,江苏 常州 213164)
  • 收稿日期:2011-04-21 出版日期:2011-11-05 发布日期:2011-11-05
  • 作者简介:解 姝(1986-),女,硕士研究生,主研方向:信息抽取;叶施仁,博士;肖 春,副教授、博士

Segmentation and Extraction for Social Media Web Page Content

XIE Shu 1, YE Shi-ren 2, XIAO Chun 1   

  1. (1. Key Laboratory of Intelligent Computing & Information Processing of MOE, Xiangtan University, Xiangtan 411105, China; 2. College of Information, Changzhou University, Changzhou 213164, China)
  • Received:2011-04-21 Online:2011-11-05 Published:2011-11-05

摘要: 为实现社会媒体网页内容的分割与抽取,利用k-means算法识别出页面的频繁块并形成一个频繁簇集合,找出该集合中的主题频繁簇,对其中的频繁块结构进行自学习,无需训练样本,即可自动生成抽取规则。实验结果表明,该方法能抽取各种风格的社会媒体网页内容,具有较高的准确率和召回率。

关键词: 社会媒体, DOM结构, k-means算法, 自学习, 抽取规则, 网页内容抽取

Abstract: This paper presents a segmentation and extraction method which does not need any hand-crafted rules and training examples for content-rich pages in social media. It identifies the frequent blocks in page by using k-means algorithm and obtains a collection of frequent clusters. It identifies the topic frequent clusters and induces extraction rules from the frequent blocks in topic frequent clusters through self-supervised approach. Experimental results show that it is efficient and robust for social media Web pages with various styles and layouts with high precision and recall rate.

Key words: social media, DOM structure, k-means algorithm, self-learning, extraction rule, Web page content extraction

中图分类号: