作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (08): 89-91. doi: 10.3969/j.issn.1000-3428.2007.08.030

• 软件技术与数据库 • 上一篇    下一篇

一种有效的网页噪声消除的方法

罗 成,李弼程,张先飞   

  1. (信息工程大学信息工程学院,郑州 450002)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-04-20 发布日期:2007-04-20

An Effective Approach to Eliminating Noises in HTML Pages

LUO Cheng, LI Bicheng, ZHANG Xianfei   

  1. (Information Engineering Institute, Information Engineering University, Zhengzhou 450002)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-04-20 Published:2007-04-20

摘要: 大多数网页都有如广告、版权、导航链接等噪声,影响Web应用系统的工作质量,因此快速准确地清除网页中的噪声内容是提高Web应用程序性能的关键技术之一。提出了一种网页净化方法,通过用模式树(PT)表示网页的布局结构,根据模式树中节点的信息熵来消除噪声,以达到网页净化的目的。试验将此方法应用于一个SVM分类系统,结果显示通过净化的网页对分类结果的正确率和高效性都有了一定的改进。

关键词: 文档树, 模式树, 基本节点, 风格节点, 网页净化

Abstract: Most Web pages usually have such noisy blocks as navigation panels, copyright and advertisements, which decreases the accuracies of Web applications system. So eliminating noises content accurately and efficiently is a key technique to improve the service qualities of Web application systems. This paper proposes a novel approach to reduce the noise content in Web pages. It uses a tree structure, called pattern tree(PT), to capture the common layout of the pages in a given Web site. It also introduces an entropy-based measure of the node in the PT to reduce noisy blocks of the site. The approach is applied in a SVM-based Web page classification system. The strong evidence of improvement in applications verifies the validity of the approach presented.

Key words: Document tree, Pattern tree, Element node, Style node, Web page purification