Author Login Editor-in-Chief Peer Review Editor Work Office Work

Computer Engineering ›› 2009, Vol. 35 ›› Issue (3): 89-90,9. doi: 10.3969/j.issn.1000-3428.2009.03.031

• Software Technology and Database • Previous Articles     Next Articles

Web Information Extraction Based on Sub-tree Breadth

WANG Quan, SHI Shao-ting   

  1. (Institute of Science & Technology Information of Gansu, Lanzhou 730000)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-02-05 Published:2009-02-05

基于子树广度的Web信息抽取

王 权,施韶亭   

  1. (甘肃省科学技术情报研究所,兰州 730000)

Abstract: This paper proposes a new method which can extract the useful information from the different document sites automatically based on the breadth of a sub-tree. Experimental evaluation on a large of Web pages from different document Web sites has done and this method has been applied to the platform of gansu science & technology document sharing successfully. Experimental result shows this method automatically extracts the information ignoring where Web sites the pages come from and has high accuracy in terms of recall and precision.

Key words: sub-tree breadth, information extraction, cross-search

摘要: 提出一种新的网页信息抽取方法,基于子树的广度可不加区分地对不同科技文献网站的页面信息进行自动抽取。对大量科技文献网站进行信息抽取实验,已应用到甘肃省科技文献共享平台。实验结果证明,该方法能不依赖科技文献网页的来源而自动地抽取相关信息,并能保证较高的数据抽取回召率和查准率。

关键词: 子树广度, 信息抽取, 跨库检索

CLC Number: