摘要: 提出了一种基于网页框架和规则的网页去除噪音的新方法,该方法根据网页中HTML标签
或
等来区分主题内容和噪音内容,在此基础上去除噪音内容。对来自CWT200G语料的132 559个网页进行测试后的结果表明,该方法可以有效地去除网页噪音,使索引文件减少约75%,大大地提高了检索速度,准确度也得到一定提高。
关键词:
信息检索,
网页噪音,
页面框架
Abstract: This paper presents an approach to eliminate noise based on framework of Web pages and rules. This approach divides a page into several parts according to HTML tag
or
related to paragraph, the noise content is eliminated based on this way. Experiments performed on a set of 132 559 Web pages from CWT200G show that this approach can eliminate noise content of Web pages effectively and decrease the size of index files to about 75%. The information retrieval speed can be faster, and the accuracy of retrieval can be improved.
Key words:
information retrieval,
noise content,
Web page framework
中图分类号:
时达明;林鸿飞;杨志豪. 基于网页框架和规则的网页噪音去除方法[J]. 计算机工程, 2007, 33(19): 276-278.
SHI Da-ming; LIN Hong-fei; YANG Zhi-hao. Approach of Eliminating Noise Based on Framework of Web Pages and Rules[J]. Computer Engineering, 2007, 33(19): 276-278.