Web Content Automatic Extraction   Based on Data Enrichment Region

doi:10.3969/j.issn.1000-3428.2013.09.043

Computer Engineering

Previous Articles Next Articles

Web Content Automatic Extraction Based on Data Enrichment Region

XU Zhi-jian, SUN Lei

(Department of Computer Science & Technology, East China Normal University, Shanghai 200241, China)

Received:2012-08-17 Online:2013-09-15 Published:2013-09-13

基于数据富集区域的Web内容自动抽取

许志坚，孙蕾

(华东师范大学计算机科学与技术系，上海 200241)

作者简介:许志坚(1987－)，男，硕士研究生，主研方向：决策支持系统，认知科学；孙蕾，副教授
基金资助:
上海自然科学基金资助项目(09ZR1409500)

Abstract

Abstract: It can provide valuable information of commodities for value-added services such as parity and price querying to automatically extracting content of commodities from these Web pages in e-commerce sites. An effective method of Web content automatic extraction is proposed for these Web pages, including denoising the target page by comparing tag tree with sample page and mining the data-rich region from target page by computing similarity between sub-trees based on tree-matching and extracting the data records from data enrichment region. Experimental result for five e-commerce Web sites shows that the precision rate of this method is higher than Mining Data Records(MDR) method, and the recall rate is high.

Key words: data enrichment region, Web content extraction, tree-matching, tag tree, sub-trees similarity, data record

摘要： 对电子商务网站的Web页面进行商品信息自动抽取，可以为进一步的增值服务，如比价、查询等提供有价值的信息。为此，提出一种Web内容自动抽取方法。通过对比标签树对目标页面进行去噪，采用基于树匹配的子树相似度计算方法挖掘目标页面的数据富集区域，从而抽取商品的数据记录。在5个电子商务网站上的实验结果表明，该方法的准确率均高于MDR方法，且召回率较高。

关键词: 数据富集区域, Web内容抽取, 树匹配, 标签树, 子树相似度, 数据记录

CLC Number:

TP391

XU Zhi-jian, SUN Lei. Web Content Automatic Extraction Based on Data Enrichment Region[J]. Computer Engineering, doi: 10.3969/j.issn.1000-3428.2013.09.043.

许志坚，孙蕾. 基于数据富集区域的Web内容自动抽取[J]. 计算机工程, doi: 10.3969/j.issn.1000-3428.2013.09.043.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2013.09.043

http://www.ecice06.com/EN/Y2013/V39/I9/192

References

[1] 刘伟, 孟小峰, 孟卫一. Deep Web数据集成研究综述[J]. 计算机学报, 2007, 30(9): 1475-1489. [2] Alberto H, Berthier A, Altigran S, et al. A Brief Survey of Web Data Extraction Tools[J]. ACM SIGMOD Record, 2002, 31(2): 84-93. [3] Crescenzi V, Mecca G, Merialdo P. Road-runner: Towards Automatic Data Extraction from Large Web Sites[C]//Proc. of the 26th International Conference on Very Large Database Systems. Roma, Italy: [s. n.], 2001. [4] Chang Chaihui, Lu Shaochen. IEPAD: Information Extraction Based on Pattern Discovery[C]//Proc. of the 10th International Conference on World Wide Web. Hong Kong, China: [s. n.], 2001. [5] Liu Bing, Grossman R, Zhai Yanhong. Mining Data Records in Web Pages[C]//Proc. of the 9th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. Washington D. C., USA: [s. n.], 2003. [6] Liu Bing, Grossman R, Zhai Yanhong. Mining Web Pages for Data Records[J]. IEEE Intelligent Systems, 2004, 19(6): 49-55. [7] Wikipedia. Patricia Tree[EB/OL]. (2010-11-21). http://de.wiki pedia.org/wiki/PAT_Tree. [8] W3C. HTML 4.0.1 Specification[EB/OL]. (2010-11-21). http:// www.w3.org/TR/html401/. [9] Wang Jiying, Lochovsky F. Data-rich Section Extraction from HTML Pages[C]//Proc. of the 3rd Conference on Web Infor- mation Systems Engineering. Singapore: [s. n.], 2002: 313- 322. [10] 胡仁龙, 袁春风, 武港山, 等. 基于重复模式的自动Web 信息抽取[J]. 计算机工程, 2008, 34(22): 73-76. [11] Wu Yang. Identifying Syntactic Differences Between Two Programs[J]. Software-practice and Experience, 1991, 21(7): 739-755. [12] Yang Guizhen, Mukherjee S, Ramakrishnan I V. On Precision and Recall of Multi-attribute Data Extraction from Semi- structured Sources[C]//Proc. of the 3rd IEEE International Conference on Data Mining. Washington D. C., USA: [s. n.], 2003. 编辑刘冰

Please choose a citation manager

Content to export

Web Content Automatic Extraction Based on Data Enrichment Region

基于数据富集区域的Web内容自动抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 6

Recommended Articles

Metrics

Comments

[1]	HUANG Wu-guan, ZHU Ming, YIN Wen-ke. Web Information Automatic Extraction Based on DOM Tree and Visual Feature [J]. Computer Engineering, 2013, 39(10): 309-312.
[2]	ZHANG Zhi-wei; LIU Deng-di; CAI Jian-yu; YUAN Kun-gang; ZHU Jin-hui. HLA-based Data Record and Playback Model [J]. Computer Engineering, 2010, 36(5): 255-256,.
[3]	LV Yong-le; LANG Rong-ling. Noise Reduction Method for Flight Data Based on Singular Value Decomposition [J]. Computer Engineering, 2010, 36(3): 260-262.
[4]	YANG Zhou, ZHUO Lin, DIAO Peng-Peng, CUI Zhi-Meng. Automatic Extraction Method for Product Data Records [J]. Computer Engineering, 2010, 36(23): 262-265.
[5]	JIU Lin, QIU Hui-Zhong. Web Page Top-down Content Information Localization Algorithm [J]. Computer Engineering, 2010, 36(13): 76-78.
[6]	YUAN Mingxuan; ZHANG Xuanping; JIANG Yu; ZHAO Zhongmeng. Noise Elimination Method in Web Pages Based on the Similarity of Same Layer Pages [J]. Computer Engineering, 2006, 32(23): 61-63.

模态框（Modal）标题

Please choose a citation manager

Content to export

Web Content Automatic Extraction Based on Data Enrichment Region

基于数据富集区域的Web内容自动抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 6

Recommended Articles

Metrics

Comments