计算机工程 ›› 2010, Vol. 36 ›› Issue (06): 102-104.doi: 10.3969/j.issn.1000-3428.2010.06.034

• 软件技术与数据库 • 上一篇    下一篇

基于内容相似度的网页正文提取

王 利1,刘宗田1,王燕华2,廖 涛1   

  1. (1. 上海大学计算机科学与工程学院,上海 200072;2. 上海海洋大学信息学院,上海 201306)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2010-03-20 发布日期:2010-03-20

Web Page Main Text Extraction Based on Content Similarity

WANG Li1, LIU Zong-tian1, WANG Yan-hua2, LIAO Tao1   

  1. (1. School of Computer Science and Engineering, Shanghai University, Shanghai 200072; 2. School of Information Technology, Shanghai Fisheries University, Shanghai 201306)
  • Received:1900-01-01 Revised:1900-01-01 Online:2010-03-20 Published:2010-03-20

摘要: 提出一种将复杂的网页脚本进行简化并映射成一棵易于操作的树型结构的方法。该方法不依赖于DOM树,无须用HTMLparser包进行解析,而是利用文本相似度计算方法,通过计算树节点中文本内容与各级标题的相似度判定小块文本信息的有用性,由此进行网页清洗与正文抽取,获得网页文本信息,实验结果表明,该方法对正文抽取具有较高的通用性与准确率。

关键词: 网页正文抽取, 网页映射, 网页清洗, 文本相似度

Abstract: This paper proposes a method of simplifying complex Web page script and mapping it into tree structure easy to operate. It does not depend on DOM tree, and does not need utilize htmlparser bag to parse. By calculating text similarity, it calculates the similarity between the content of tree node and headings of different levels to determine the usefulness of the text information, cleans the Web page and extracts the content information. Experimental results show that the method has better universal property and accuracy rate in main text extraction.

Key words: Web page main text extraction, Web page mapping, Web page cleaning, text similarity

中图分类号: