作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

结合网页结构与文本特征的正文提取方法

熊忠阳,蔺显强,张玉芳,牙 漫   

  1. (重庆大学计算机学院,重庆 400044)
  • 收稿日期:2012-11-21 出版日期:2013-12-15 发布日期:2013-12-13
  • 作者简介:熊忠阳(1962-),男,教授,主研方向:数据挖掘,网格技术,并行计算;蔺显强,硕士研究生;张玉芳,教授;牙 漫,硕士研究生
  • 基金资助:
    国家自然科学基金资助项目(71102065)

Content Extraction Method Combining Web Page Structure and Text Feature

XIONG Zhong-yang, LIN Xian-qiang, ZHANG Yu-fang, YA Man   

  1. (College of Computer Science, Chongqing University, Chongqing 400044, China)
  • Received:2012-11-21 Online:2013-12-15 Published:2013-12-13

摘要: 网页中存在正文信息以及与正文无关的信息,无关信息的存在对Web页面的分类、存储及检索等带来负面的影响。为降低无关信息的影响,从网页的结构特征和文本特征出发,提出一种结合网页结构特征与文本特征的正文提取方法。通过正则表达式去除网页中的无关元素,完成对网页的初次过滤。根据网页的结构特征对网页进行线性分块,依据各个块的文本特征将其区分为链接块与文本块,并利用噪音块连续出现的结果完成对正文部分的定位,得到网页正文信息。实验结果表明,该方法能够快速准确地提取网页的正文内容。

关键词: 正文提取, 网页去噪, 网页分块, 主题爬行, 信息检索, Web挖掘

Abstract: There are both relevant information and irrelevant information in a Web page, the irrelevant information brings some negative influence to their classification, storage and retrieve. In order to reduce the influence, aiming at theme-related Web pages, this paper proposes a new method to extract the content of Web pages based on their text and structural features. It removes those unrelated tags in the Web page by regular expressions, and segments the Web into blocks according to Web pages structure and the text information. By analyzing the text blocks and link blocks of the Web, it only retains the main content of the page; those noisy parts are deleted from the page. Experimental result shows that the method is feasible and of high accuracy in page cleaning and content extraction.

Key words: content extraction, Web page denoising, Web page segmentation, subject crawling, information retrieve, Web mining

中图分类号: