Automatic Web Information Extraction Based on Repetitive Pattern

doi:10.3969/j.issn.1000-3428.2008.22.025

Computer Engineering ›› 2008, Vol. 34 ›› Issue (22): 73-76. doi: 10.3969/j.issn.1000-3428.2008.22.025

• Software Technology and Database • Previous Articles Next Articles

Automatic Web Information Extraction Based on Repetitive Pattern

HU Ren-long, YUAN Chun-feng, WU Gang-shan, PU Xiao-jia

(State Key Laboratory for Computer Novel Software Technology, Nanjing University, Nanjing 210093)

Received:1900-01-01 Revised:1900-01-01 Online:2008-11-20 Published:2008-11-20

基于重复模式的自动Web信息抽取

胡仁龙，袁春风，武港山，濮小佳

(南京大学计算机软件新技术国家重点实验室，南京 210093)

Abstract

Abstract: There are many on-line shopping Web sites on WWW, and commodity information in these Web pages can be extracted for E-commerce and Web-query. This paper presents an automated approach for Web information extraction against these Web sites. The approach finds the topic area by detecting repetitive patterns and analyzing the characteristics of topic area in a single Web page. There are no human interactions during extraction. The approach tests 10 on-line shopping sites and experimental results show that the approach is effective.

Key words: Web information extraction, DOM tree, repetitive pattern

摘要： 互联网上存在很多在线购物网站，抽取这类网站页面里的商品信息可以为电子商务、Web查询提供增值服务。该文针对这类网站提出一种自动的Web信息抽取方法，通过检测网页中的重复模式以及分析主题内容的特征获取网页的主题内容，该方法在抽取过程中不需要人工干预。对10个在线购物网站进行了测试，实验结果表明提出的方法是有效的。

关键词: Web信息抽取, DOM树, 重复模式

CLC Number:

TP311

HU Ren-long; YUAN Chun-feng; WU Gang-shan; PU Xiao-jia. Automatic Web Information Extraction Based on Repetitive Pattern[J]. Computer Engineering, 2008, 34(22): 73-76.

胡仁龙;袁春风;武港山;濮小佳. 基于重复模式的自动Web信息抽取[J]. 计算机工程, 2008, 34(22): 73-76.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.3969/j.issn.1000-3428.2008.22.025

http://www.ecice06.com/EN/Y2008/V34/I22/73

[1]	LI Guanda, JIN Jing, WANG Fan, XIA Yingwei, YANG Xuezhi. Efficient Path Planning Algorithm Using Topology for Indoor Environment [J]. Computer Engineering, 2022, 48(6): 95-106.
[2]	WEI Haiyu,WANG Yong,KE Wenlong,FENG Hao. Abnormal Network Traffic Classification Based on Improved Extremely Random Tree [J]. Computer Engineering, 2018, 44(11): 33-39.
[3]	WANG Hui,YU Bo,HONG Yu,XIAO Yanghua. Web Information Extraction System Based on Knowledge Graph [J]. Computer Engineering, 2017, 43(6): 118-124.
[4]	ZHAO Xiao-bao, ZHANG Hua-ping. New Words Identification Based on Iterative Algorithm [J]. Computer Engineering, 2014, 40(7): 154-158,164.
[5]	GUO Jian-Bing, CUI Zhi-Meng, CHEN Meng, DIAO Peng-Peng. Web Extraction Method Based on DOM Tree and Domain Ontology [J]. Computer Engineering, 2012, 38(5): 56-58.
[6]	LI Qiang-Cheng, ZHANG An-Zhan, GONG Xiao-Li, ZHANG Jin. Research and Realization of Online Book System in E-paper Reader [J]. Computer Engineering, 2012, 38(3): 261-264.
[7]	XU Long, JIAN Jiang. Description and Analysis of Topic-oriented Information Extraction Requirement [J]. Computer Engineering, 2012, 38(23): 57-59.
[8]	ZHANG Chen, HONG Yong-Yi, WANG Xiong, SHI Fan. SQL Injection Vulnerability Detection Based on Webpage DOM Tree Comparison [J]. Computer Engineering, 2012, 38(18): 111-115.
[9]	TANG Chao-Wei, LI Dun, MIAO Guang-Qing, DU Xin-Hui. Video Metadata Extraction System Based on DOM Tree [J]. Computer Engineering, 2012, 38(08): 268-270.
[10]	ZHANG Zhi-Yuan, XU Chao, FENG Xia. Auto Generation Technology for Flight Information Extraction Rules [J]. Computer Engineering, 2011, 37(6): 65-67.
[11]	SHI Yang; ZHANG Qi; HUANG Xuan-jing. Automatic Web News Extraction with Semantic Features [J]. Computer Engineering, 2010, 36(7): 173-175,.
[12]	YANG Zhou, ZHUO Lin, DIAO Peng-Peng, CUI Zhi-Meng. Automatic Extraction Method for Product Data Records [J]. Computer Engineering, 2010, 36(23): 262-265.
[13]	SHU Jin-Hui, LIANG Meng-Jie, LIANG Ying-Ju, MIN Hua-Qing, ZHANG Mei. Adaptive Weighted Rapidlyexploring Random Tree Algorithm [J]. Computer Engineering, 2010, 36(23): 16-18.
[14]	HUANG Xin; SANG Nan. Zone Tree Model Based on DOM Tree and Recursive X-Y Cut Algorithm [J]. Computer Engineering, 2009, 35(5): 53-55.
[15]	GONG Ji-bing; TANG Jie. Ontology-based Video Describing Information Extraction System [J]. Computer Engineering, 2009, 35(18): 34-36.

Please choose a citation manager

Content to export

Automatic Web Information Extraction Based on Repetitive Pattern

基于重复模式的自动Web信息抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments

模态框（Modal）标题

Please choose a citation manager

Content to export

Automatic Web Information Extraction Based on Repetitive Pattern

基于重复模式的自动Web信息抽取

PDF

Knowledge

Cited

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics

Comments