摘要: 大多数网页都有如广告、版权、导航链接等噪声,影响Web应用系统的工作质量,因此快速准确地清除网页中的噪声内容是提高Web应用程序性能的关键技术之一。提出了一种网页净化方法,通过用模式树(PT)表示网页的布局结构,根据模式树中节点的信息熵来消除噪声,以达到网页净化的目的。试验将此方法应用于一个SVM分类系统,结果显示通过净化的网页对分类结果的正确率和高效性都有了一定的改进。
关键词:
文档树,
模式树,
基本节点,
风格节点,
网页净化
Abstract: Most Web pages usually have such noisy blocks as navigation panels, copyright and advertisements, which decreases the accuracies of Web applications system. So eliminating noises content accurately and efficiently is a key technique to improve the service qualities of Web application systems. This paper proposes a novel approach to reduce the noise content in Web pages. It uses a tree structure, called pattern tree(PT), to capture the common layout of the pages in a given Web site. It also introduces an entropy-based measure of the node in the PT to reduce noisy blocks of the site. The approach is applied in a SVM-based Web page classification system. The strong evidence of improvement in applications verifies the validity of the approach presented.
Key words:
Document tree,
Pattern tree,
Element node,
Style node,
Web page purification
罗 成;李弼程;张先飞. 一种有效的网页噪声消除的方法[J]. 计算机工程, 2007, 33(08): 89-91.
LUO Cheng; LI Bicheng; ZHANG Xianfei. An Effective Approach to Eliminating Noises in HTML Pages[J]. Computer Engineering, 2007, 33(08): 89-91.