Abstract:
This paper presents an improved approach for finding data blocks in the HTML tag tree to mine the data regions embedded in a Web page. A policy of combining the Web page clustering and cross-page data region analysis is proposed to identify the dynamical Web data regions. Experimental results show the effectiveness of given approach.
Key words:
Web data regions extraction,
Dynamical data regions identification,
Cross-page analysis
摘要: 采用基于HTML标记树的数据块查找方法挖掘Web网页中的数据区域,在此基础上结合网页聚类和跨网页数据区域匹配自动识别一个网页中的动态数据区域。实验结果表明,该方法能够提高Web网页中动态数据区域识别的召回率和准确率。
关键词:
Web数据区域抽取,
动态数据区域识别,
跨网页分析
CLC Number:
HUANG Jianbin; JI Hongbing; SUN Heli. Dynamical Data Regions Identification and Extraction in Web Pages[J]. Computer Engineering, 2007, 33(11): 53-55,5.
黄健斌;姬红兵;孙鹤立. Web网页中动态数据区域的识别与抽取[J]. 计算机工程, 2007, 33(11): 53-55,5.