Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2007, Vol. 33 ›› Issue (11): 53-55,5. doi: 10.3969/j.issn.1000-3428.2007.11.020

• Software Technology and Database • Previous Articles     Next Articles

Dynamical Data Regions Identification and Extraction in Web Pages

HUANG Jianbin1,2, JI Hongbing1, SUN Heli3   

  1. (1. School of Electronic Engineering, Xidian University, Xi’an 710071; 2. School of Computer Science, Xidian University, Xi’an 710071; 3. Department of Computer Science & Technology, Xi’an Jiaotong University, Xi’an 710049)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-06-05 Published:2007-06-05

Web网页中动态数据区域的识别与抽取

黄健斌1,2,姬红兵1,孙鹤立3   

  1. (1. 西安电子科技大学电子工程学院,西安 710071;2. 西安电子科技大学计算机学院,西安 710071; 3. 西安交通大学计算机科学与技术系,西安 710049)

Abstract: This paper presents an improved approach for finding data blocks in the HTML tag tree to mine the data regions embedded in a Web page. A policy of combining the Web page clustering and cross-page data region analysis is proposed to identify the dynamical Web data regions. Experimental results show the effectiveness of given approach.

Key words: Web data regions extraction, Dynamical data regions identification, Cross-page analysis

摘要: 采用基于HTML标记树的数据块查找方法挖掘Web网页中的数据区域,在此基础上结合网页聚类和跨网页数据区域匹配自动识别一个网页中的动态数据区域。实验结果表明,该方法能够提高Web网页中动态数据区域识别的召回率和准确率。

关键词: Web数据区域抽取, 动态数据区域识别, 跨网页分析

CLC Number: