作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2006, Vol. 32 ›› Issue (23): 61-63. doi: 10.3969/j.issn.1000-3428.2006.23.022

• 软件技术与数据库 • 上一篇    下一篇

一种基于同层网页相似性去除网页噪音的方法

袁明轩,张选平,蒋 宇,赵仲孟   

  1. (西安交通大学电信学院软件研究所,710049)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-12-05 发布日期:2006-12-05

Noise Elimination Method in Web Pages Based on the Similarity of Same Layer Pages

YUAN Mingxuan, ZHANG Xuanping, JIANG Yu, ZHAO Zhongmeng   

  1. (Institute of Software, Dept. of Computer Science & Engineering, Xi’an Jiaotong University, Xi’an, 710049)
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-12-05 Published:2006-12-05

摘要: 一个普通的Web页面可以被分成信息块和噪音块两部分。基于web信息检索的第1步就是过滤掉网页中的噪音块。通过网页的特性可以看出,同层网页大多具有相似的显示风格和噪音块。在VIPS算法的基础上,该文提出一种基于同层网页相似性的匹配算法,这个算法可以被用来过滤网页中的噪音块。通过实验检测,算法可以达到95%以上的准确率。

关键词: 网页噪音, VIPS算法, 相似树比较

Abstract: A common Web page could be separated into two categories: valuable segments and noise segments. The first step of information retrieval on the Web is to eliminate noise segments or blocks. This paper studies the properties of Web pages and finds out that Web pages with a common URL prefix always have the similar presentation styles and noise segments. Based on vision-based page segmentation (VIPS), it proposes an approximate sub-tree matching algorithm, which could be used to eliminate noise segmentations in a Web page. The implemented algorithm could achieve 95% accurate noise block.

Key words: Web page noise, VIPS algorithm, Approximate tree-matching