作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2008, Vol. 34 ›› Issue (15): 95-96,1. doi: 10.3969/j.issn.1000-3428.2008.15.033

• 软件技术与数据库 • 上一篇    下一篇

基于Nutch的XML网站全文搜索引擎实现

吴敏琦,丁岳伟   

  1. (上海理工大学计算机工程学院,上海 200093)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2008-08-05 发布日期:2008-08-05

Implementation of XML Website Complete Text Search Engine Based on Nutch

WU Min-qi, DING Yue-wei   

  1. (College of Computer Engineering, University of Shanghai for Science and Technology, Shanghai 200093)
  • Received:1900-01-01 Revised:1900-01-01 Online:2008-08-05 Published:2008-08-05

摘要: 普通搜索引擎的网页抓取程序只能理解常见HTML标签,无法对XML网站的内容做有效解析。该文建立一个包含动态自定义标签的纯XML网站,提出借助XSL样式信息帮助网页抓取程序理解XML网页标签含义的方案,实现了基于Nutch的XML网站全文搜索引擎。

关键词: XML信息检索, 可扩展样式表语言转换, 基于Nutch的搜索引擎

Abstract: General search engine spiders can understand only common HTML tags, and can’t parser information from XML Web sites efficiently. This paper proposes a strategy of using XSL to help spiders to understand the structure of XML pages. Based on this strategy, a pure XML Website is set up, and a search engine based on Nutch which is able to parse XML Website content correctly is realized.

Key words: XML information retrieval, eXtensible Stylesheet Language Transformations(XSLT), search engine based on Nutch

中图分类号: