计算机工程 ›› 2008, Vol. 34 ›› Issue (24): 58-60.doi: 10.3969/j.issn.1000-3428.2008.24.020

• 软件技术与数据库 • 上一篇    下一篇

基于DOM修剪的藏文Web信息提取

珠 杰,欧 珠,格桑多吉   

  1. (西藏大学计算机科学与技术系,拉萨 850000)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-12-20 发布日期:2008-12-20

Tibetan Web Information Extraction Based on DOM Pruning

Zhu Jie, Ngodrup, GeSang Dorje   

  1. (Department of Computer Science and Technology, Tibetan University, Lhasa 850000)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-12-20 Published:2008-12-20

摘要: 随着互联网的普及和藏文信息技术的不断发展,出现了大量的藏文网站。该文根据藏文“音节点”的特征识别藏文网页并进行抓取。在建立DOM树的基础上,分析网页的链接、非链接文本与主题信息块之间的相关度。通过语义修剪算法提取藏文主题信息。经测试证实,该算法在藏文网页识别和藏文主题信息提取中具有较好的适应性。

关键词: 音节点, DOM树, 藏文, Web信息提取

Abstract: With the widespread use of Internet and the development of Tibetan information technology, there are a lot of Websites of Tibetan information resource. This paper identifies Tibetan Web page and crawls it according to features of Tibetan syllable dot. Based on DOM, it analyzes relevance between linked and non-linked Web page text with topical information via pruning semantics algorithm to extract Tibetan topical information. Test result shows that the algorithm to identify and extract in the Tibetan Websites topical information has good adaptation.

Key words: syllable dot, DOM tree, Tibetan, Web information extraction

中图分类号: