作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2011, Vol. 37 ›› Issue (23): 276-278. doi: 10.3969/j.issn.1000-3428.2011.23.093

• 开发研究与设计技术 • 上一篇    下一篇

基于图文有效信息量的网页正文定位

梁正友,欧 杰,俞闽敏   

  1. (广西大学计算机与电子信息学院,南宁 530004)
  • 收稿日期:2011-06-17 出版日期:2011-12-05 发布日期:2011-12-05
  • 作者简介:梁正友(1968-),男,教授、博士,主研方向:信息检索,分布式计算;欧 杰、俞闽敏,硕士
  • 基金资助:
    广西自然科学基金资助项目(桂科自0832059)

Webpage Main Text Localization Based on Image and Text Effective Information Content

LIANG Zheng-you, OU Jie, YU Min-min   

  1. (School of Computer, Electronics and Information, Guangxi University, Nanning 530004, China)
  • Received:2011-06-17 Online:2011-12-05 Published:2011-12-05

摘要: 在现有的网页抽取技术中,正文定位方法仅考虑网页文本信息,当正文图片信息较多、文本信息偏少时,容易出现偏差,且定位准确率较低。针对该问题,从信息论角度出发,结合网页中的文本信息图片信息,设计一种对网页中图片信息量和有效信息量的估算方法,在此基础上,提出一种基于图文信息量的网页正文定位算法。实验结果表明,该算法在不同正文文本量的情况下,均具有较高的定位准 确率。

关键词: 正文定位, 最小正文子树, 有效信息率, 网页, 图文

Abstract: Existed main text localization methods in webpage information extraction technologies only consider the text information. Those methods lead to low accuracy when main text contains a few text information and abundant image information. In order to solve this problem, this paper designs a method to estimate the image information and image effective information based on information theory, and presents a novel algorithm for main text of webpage localization based on image and text effective information. Experimental results show that on different main text ratio, this algorithm has higher accuracy.

Key words: main text localization, minimal main text sub-tree, effective information ratio, webpage, image and text

中图分类号: