Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering

Previous Articles     Next Articles

Web Content Automatic Extraction Based on Data Enrichment Region

XU Zhi-jian, SUN Lei   

  1. (Department of Computer Science & Technology, East China Normal University, Shanghai 200241, China)
  • Received:2012-08-17 Online:2013-09-15 Published:2013-09-13

基于数据富集区域的Web内容自动抽取

许志坚,孙 蕾   

  1. (华东师范大学计算机科学与技术系,上海 200241)
  • 作者简介:许志坚(1987-),男,硕士研究生,主研方向:决策支持系统,认知科学;孙 蕾,副教授
  • 基金资助:
    上海自然科学基金资助项目(09ZR1409500)

Abstract: It can provide valuable information of commodities for value-added services such as parity and price querying to automatically extracting content of commodities from these Web pages in e-commerce sites. An effective method of Web content automatic extraction is proposed for these Web pages, including denoising the target page by comparing tag tree with sample page and mining the data-rich region from target page by computing similarity between sub-trees based on tree-matching and extracting the data records from data enrichment region. Experimental result for five e-commerce Web sites shows that the precision rate of this method is higher than Mining Data Records(MDR) method, and the recall rate is high.

Key words: data enrichment region, Web content extraction, tree-matching, tag tree, sub-trees similarity, data record

摘要: 对电子商务网站的Web页面进行商品信息自动抽取,可以为进一步的增值服务,如比价、查询等提供有价值的信息。为此,提出一种Web内容自动抽取方法。通过对比标签树对目标页面进行去噪,采用基于树匹配的子树相似度计算方法挖掘目标页面的数据富集区域,从而抽取商品的数据记录。在5个电子商务网站上的实验结果表明,该方法的准确率均高于MDR方法,且召回率较高。

关键词: 数据富集区域, Web内容抽取, 树匹配, 标签树, 子树相似度, 数据记录

CLC Number: