作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2007, Vol. 33 ›› Issue (19): 276-278. doi: 10.3969/j.issn.1000-3428.2007.19.098

• 开发研究与设计技术 • 上一篇    下一篇

基于网页框架和规则的网页噪音去除方法

时达明,林鸿飞,杨志豪   

  1. (大连理工大学计算机科学与工程系,大连 116024)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2007-10-05 发布日期:2007-10-05

Approach of Eliminating Noise Based on Framework of Web Pages and Rules

SHI Da-ming, LIN Hong-fei, YANG Zhi-hao   

  1. (Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116024)
  • Received:1900-01-01 Revised:1900-01-01 Online:2007-10-05 Published:2007-10-05

摘要: 提出了一种基于网页框架和规则的网页去除噪音的新方法,该方法根据网页中HTML标签


等来区分主题内容和噪音内容,在此基础上去除噪音内容。对来自CWT200G语料的132 559个网页进行测试后的结果表明,该方法可以有效地去除网页噪音,使索引文件减少约75%,大大地提高了检索速度,准确度也得到一定提高。

关键词: 信息检索, 网页噪音, 页面框架

Abstract: This paper presents an approach to eliminate noise based on framework of Web pages and rules. This approach divides a page into several parts according to HTML tag

or
related to paragraph, the noise content is eliminated based on this way. Experiments performed on a set of 132 559 Web pages from CWT200G show that this approach can eliminate noise content of Web pages effectively and decrease the size of index files to about 75%. The information retrieval speed can be faster, and the accuracy of retrieval can be improved.

Key words: information retrieval, noise content, Web page framework

中图分类号: