摘要: 普通搜索引擎的网页抓取程序只能理解常见HTML标签,无法对XML网站的内容做有效解析。该文建立一个包含动态自定义标签的纯XML网站,提出借助XSL样式信息帮助网页抓取程序理解XML网页标签含义的方案,实现了基于Nutch的XML网站全文搜索引擎。
关键词:
XML信息检索,
可扩展样式表语言转换,
基于Nutch的搜索引擎
Abstract: General search engine spiders can understand only common HTML tags, and can’t parser information from XML Web sites efficiently. This paper proposes a strategy of using XSL to help spiders to understand the structure of XML pages. Based on this strategy, a pure XML Website is set up, and a search engine based on Nutch which is able to parse XML Website content correctly is realized.
Key words:
XML information retrieval,
eXtensible Stylesheet Language Transformations(XSLT),
search engine based on Nutch
中图分类号:
吴敏琦;丁岳伟. 基于Nutch的XML网站全文搜索引擎实现[J]. 计算机工程, 2008, 34(15): 95-96,1.
WU Min-qi; DING Yue-wei. Implementation of XML Website Complete Text Search Engine Based on Nutch[J]. Computer Engineering, 2008, 34(15): 95-96,1.