Abstract:
This paper designes a focused crawler using context graph. The crawler is based on a set of Naive Bayes classifiers, which adopt both VSM and probability model for design comparison purpose. The frontier priority queue within a layer of the context graph is sorted by the cosine similarity between a downloaded normalized document vector and the query vector. An approach to classifying search results into a pre-defined category is presented.
Key words:
Focused crawling; Machine learning; Context graph
摘要: 介绍了一个基于语景图的Web 主题爬取器的初步设计。描述了NB 分类器的文本学习的向量空间模型——Bernoulli 模型及NaiveBayes 分类器设计提出了简化的前端队列优先排序的设计方案,即下载文档的归一化文档向量与查询向量的余弦相似度,作为层内下载文档的排序准则,以便与各层队列中文档的类似然率得分排序进行对比。介绍了自动实现爬取结果与主题分类目录的集成设想。
关键词:
主题爬取;机器学习;语景图
LI Daosheng, ZHAO Qiang. Preliminary Design of A Context-Graph-based Focused Crawler[J]. Computer Engineering, 2006, 32(12): 208-209,228.
李道生,赵 强. 基于语景图的主题爬取器的初步设计[J]. 计算机工程, 2006, 32(12): 208-209,228.