摘要: 通过建立一个共现词库改进主题模型,以提高下载网页的主题相关度及质量,并且能描述其语境的上下文,揣测用户意图,调节检索结果排序。在此基础上设计并实现一个FDC主题爬虫系统,该系统采用改进的主题敏感FDC-PageRank算法来计算网页优先级。实验表明其效果良好。
关键词:
主题爬虫,
共现词,
FDC主题模型,
FDC_Topic Sensitive PageRank算法
Abstract: This paper improves the topic mode through a co-occurrence words database. The topic mode can advance the rate of relationship and quality. Besides, it can describe the environment of key words, conjecture the purpose of users and adjust the rank of search result. A topic crawler system which employs topic sensitive FDC-PageRank to predict the priority of Web page is designed and implemented. Experiments show the system performs well.
Key words:
topic crawler,
co-occurrence words,
FDC topic model,
FDC_Topic Sensitive PageRank algorithm
中图分类号:
葛 玲;蒋宗礼. 基于共现词查询的主题爬虫研究[J]. 计算机工程, 2010, 36(8): 286-288.
GE Ling; JIANG Zong-li. Research of Co-occurrence Words Search-based Topic Crawler[J]. Computer Engineering, 2010, 36(8): 286-288.