Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2008, Vol. 34 ›› Issue (22): 73-76. doi: 10.3969/j.issn.1000-3428.2008.22.025

• Software Technology and Database • Previous Articles     Next Articles

Automatic Web Information Extraction Based on Repetitive Pattern

HU Ren-long, YUAN Chun-feng, WU Gang-shan, PU Xiao-jia   

  1. (State Key Laboratory for Computer Novel Software Technology, Nanjing University, Nanjing 210093)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-11-20 Published:2008-11-20

基于重复模式的自动Web信息抽取

胡仁龙,袁春风,武港山,濮小佳   

  1. (南京大学计算机软件新技术国家重点实验室,南京 210093)

Abstract: There are many on-line shopping Web sites on WWW, and commodity information in these Web pages can be extracted for E-commerce and Web-query. This paper presents an automated approach for Web information extraction against these Web sites. The approach finds the topic area by detecting repetitive patterns and analyzing the characteristics of topic area in a single Web page. There are no human interactions during extraction. The approach tests 10 on-line shopping sites and experimental results show that the approach is effective.

Key words: Web information extraction, DOM tree, repetitive pattern

摘要: 互联网上存在很多在线购物网站,抽取这类网站页面里的商品信息可以为电子商务、Web查询提供增值服务。该文针对这类网站提出一种自动的Web信息抽取方法,通过检测网页中的重复模式以及分析主题内容的特征获取网页的主题内容,该方法在抽取过程中不需要人工干预。对10个在线购物网站进行了测试,实验结果表明提出的方法是有效的。

关键词: Web信息抽取, DOM树, 重复模式

CLC Number: