作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (3): 89-90,9. doi: 10.3969/j.issn.1000-3428.2009.03.031

• 软件技术与数据库 • 上一篇    下一篇

基于子树广度的Web信息抽取

王 权,施韶亭   

  1. (甘肃省科学技术情报研究所,兰州 730000)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-02-05 发布日期:2009-02-05

Web Information Extraction Based on Sub-tree Breadth

WANG Quan, SHI Shao-ting   

  1. (Institute of Science & Technology Information of Gansu, Lanzhou 730000)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-02-05 Published:2009-02-05

摘要: 提出一种新的网页信息抽取方法,基于子树的广度可不加区分地对不同科技文献网站的页面信息进行自动抽取。对大量科技文献网站进行信息抽取实验,已应用到甘肃省科技文献共享平台。实验结果证明,该方法能不依赖科技文献网页的来源而自动地抽取相关信息,并能保证较高的数据抽取回召率和查准率。

关键词: 子树广度, 信息抽取, 跨库检索

Abstract: This paper proposes a new method which can extract the useful information from the different document sites automatically based on the breadth of a sub-tree. Experimental evaluation on a large of Web pages from different document Web sites has done and this method has been applied to the platform of gansu science & technology document sharing successfully. Experimental result shows this method automatically extracts the information ignoring where Web sites the pages come from and has high accuracy in terms of recall and precision.

Key words: sub-tree breadth, information extraction, cross-search

中图分类号: