作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于数据富集区域的Web内容自动抽取

许志坚,孙 蕾   

  1. (华东师范大学计算机科学与技术系,上海 200241)
  • 收稿日期:2012-08-17 出版日期:2013-09-15 发布日期:2013-09-13
  • 作者简介:许志坚(1987-),男,硕士研究生,主研方向:决策支持系统,认知科学;孙 蕾,副教授
  • 基金资助:
    上海自然科学基金资助项目(09ZR1409500)

Web Content Automatic Extraction Based on Data Enrichment Region

XU Zhi-jian, SUN Lei   

  1. (Department of Computer Science & Technology, East China Normal University, Shanghai 200241, China)
  • Received:2012-08-17 Online:2013-09-15 Published:2013-09-13

摘要: 对电子商务网站的Web页面进行商品信息自动抽取,可以为进一步的增值服务,如比价、查询等提供有价值的信息。为此,提出一种Web内容自动抽取方法。通过对比标签树对目标页面进行去噪,采用基于树匹配的子树相似度计算方法挖掘目标页面的数据富集区域,从而抽取商品的数据记录。在5个电子商务网站上的实验结果表明,该方法的准确率均高于MDR方法,且召回率较高。

关键词: 数据富集区域, Web内容抽取, 树匹配, 标签树, 子树相似度, 数据记录

Abstract: It can provide valuable information of commodities for value-added services such as parity and price querying to automatically extracting content of commodities from these Web pages in e-commerce sites. An effective method of Web content automatic extraction is proposed for these Web pages, including denoising the target page by comparing tag tree with sample page and mining the data-rich region from target page by computing similarity between sub-trees based on tree-matching and extracting the data records from data enrichment region. Experimental result for five e-commerce Web sites shows that the precision rate of this method is higher than Mining Data Records(MDR) method, and the recall rate is high.

Key words: data enrichment region, Web content extraction, tree-matching, tag tree, sub-trees similarity, data record

中图分类号: