基于数据富集区域的Web内容自动抽取

doi:10.3969/j.issn.1000-3428.2013.09.043

计算机工程

基于数据富集区域的Web内容自动抽取

许志坚，孙蕾

(华东师范大学计算机科学与技术系，上海 200241)

收稿日期:2012-08-17 出版日期:2013-09-15 发布日期:2013-09-13
作者简介:许志坚(1987－)，男，硕士研究生，主研方向：决策支持系统，认知科学；孙蕾，副教授
基金资助:
上海自然科学基金资助项目(09ZR1409500)

Web Content Automatic Extraction Based on Data Enrichment Region

XU Zhi-jian, SUN Lei

(Department of Computer Science & Technology, East China Normal University, Shanghai 200241, China)

Received:2012-08-17 Online:2013-09-15 Published:2013-09-13

摘要/Abstract

摘要： 对电子商务网站的Web页面进行商品信息自动抽取，可以为进一步的增值服务，如比价、查询等提供有价值的信息。为此，提出一种Web内容自动抽取方法。通过对比标签树对目标页面进行去噪，采用基于树匹配的子树相似度计算方法挖掘目标页面的数据富集区域，从而抽取商品的数据记录。在5个电子商务网站上的实验结果表明，该方法的准确率均高于MDR方法，且召回率较高。

关键词: 数据富集区域, Web内容抽取, 树匹配, 标签树, 子树相似度, 数据记录

Abstract: It can provide valuable information of commodities for value-added services such as parity and price querying to automatically extracting content of commodities from these Web pages in e-commerce sites. An effective method of Web content automatic extraction is proposed for these Web pages, including denoising the target page by comparing tag tree with sample page and mining the data-rich region from target page by computing similarity between sub-trees based on tree-matching and extracting the data records from data enrichment region. Experimental result for five e-commerce Web sites shows that the precision rate of this method is higher than Mining Data Records(MDR) method, and the recall rate is high.

Key words: data enrichment region, Web content extraction, tree-matching, tag tree, sub-trees similarity, data record

中图分类号:

TP391

许志坚，孙蕾. 基于数据富集区域的Web内容自动抽取[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2013.09.043.

XU Zhi-jian, SUN Lei. Web Content Automatic Extraction Based on Data Enrichment Region[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2013.09.043.

http://www.ecice06.com/CN/Y2013/V39/I9/192

参考文献

[1] 刘伟, 孟小峰, 孟卫一. Deep Web数据集成研究综述[J]. 计算机学报, 2007, 30(9): 1475-1489. [2] Alberto H, Berthier A, Altigran S, et al. A Brief Survey of Web Data Extraction Tools[J]. ACM SIGMOD Record, 2002, 31(2): 84-93. [3] Crescenzi V, Mecca G, Merialdo P. Road-runner: Towards Automatic Data Extraction from Large Web Sites[C]//Proc. of the 26th International Conference on Very Large Database Systems. Roma, Italy: [s. n.], 2001. [4] Chang Chaihui, Lu Shaochen. IEPAD: Information Extraction Based on Pattern Discovery[C]//Proc. of the 10th International Conference on World Wide Web. Hong Kong, China: [s. n.], 2001. [5] Liu Bing, Grossman R, Zhai Yanhong. Mining Data Records in Web Pages[C]//Proc. of the 9th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. Washington D. C., USA: [s. n.], 2003. [6] Liu Bing, Grossman R, Zhai Yanhong. Mining Web Pages for Data Records[J]. IEEE Intelligent Systems, 2004, 19(6): 49-55. [7] Wikipedia. Patricia Tree[EB/OL]. (2010-11-21). http://de.wiki pedia.org/wiki/PAT_Tree. [8] W3C. HTML 4.0.1 Specification[EB/OL]. (2010-11-21). http:// www.w3.org/TR/html401/. [9] Wang Jiying, Lochovsky F. Data-rich Section Extraction from HTML Pages[C]//Proc. of the 3rd Conference on Web Infor- mation Systems Engineering. Singapore: [s. n.], 2002: 313- 322. [10] 胡仁龙, 袁春风, 武港山, 等. 基于重复模式的自动Web 信息抽取[J]. 计算机工程, 2008, 34(22): 73-76. [11] Wu Yang. Identifying Syntactic Differences Between Two Programs[J]. Software-practice and Experience, 1991, 21(7): 739-755. [12] Yang Guizhen, Mukherjee S, Ramakrishnan I V. On Precision and Recall of Multi-attribute Data Extraction from Semi- structured Sources[C]//Proc. of the 3rd IEEE International Conference on Data Mining. Washington D. C., USA: [s. n.], 2003. 编辑刘冰

选择文件类型/文献管理软件名称

选择包含的内容

基于数据富集区域的Web内容自动抽取

Web Content Automatic Extraction Based on Data Enrichment Region

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价

[1]	黄武冠，朱明，尹文科. 基于DOM树和视觉特征的网页信息自动抽取[J]. 计算机工程, 2013, 39(10): 309-312.
[2]	郭建兵, 崔志明, 陈明, 赵朋朋. 基于DOM树与领域本体的Web抽取方法[J]. 计算机工程, 2012, 38(5): 56-58.
[3]	黄云, 洪佳明, 覃遵跃. 大型网络中近似子图匹配研究[J]. 计算机工程, 2012, 38(18): 50-52.
[4]	吕永乐;郎荣玲. 基于奇异值分解的飞行数据降噪方法[J]. 计算机工程, 2010, 36(3): 260-262.
[5]	杨舟, 卓林, 赵朋朋, 崔志明. 一种针对商品数据记录的自动抽取方法[J]. 计算机工程, 2010, 36(23): 262-265.
[6]	缪霖, 邱会中. Web页面自顶向下的正文信息定位算法[J]. 计算机工程, 2010, 36(13): 76-78.
[7]	曲著伟;李敏强. 基于数据区域发现的信息抽取规则生成方法[J]. 计算机工程, 2009, 35(22): 59-61.
[8]	姜　波;丁岳伟. 基于约束树编辑距离与导航树的信息采集[J]. 计算机工程, 2009, 35(14): 75-77.
[9]	王新生;郭慧. 聚集组播组-树匹配算法[J]. 计算机工程, 2008, 34(13): 98-100.

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于数据富集区域的Web内容自动抽取

Web Content Automatic Extraction Based on Data Enrichment Region

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 9

编辑推荐

Metrics

本文评价