计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于查询意图识别与主题建模的文档检索算法

严锐,李石君   

  1. (武汉大学 计算机学院,武汉 430072)
  • 收稿日期:2017-02-20 出版日期:2018-03-15 发布日期:2018-03-15
  • 作者简介:严锐(1992—),男,硕士研究生,主研方向为Web数据挖掘、自然语言处理;李石君,教授、博士生导师。
  • 基金项目:
    国家自然科学基金(61272109);国家自然科学青年基金(61502350)。

Document Retrieval Algorithm Based on Query Intent Identification and Topic Modeling

YAN Rui,LI Shijun   

  1. (School of Computer,Wuhan University,Wuhan 430072,China)
  • Received:2017-02-20 Online:2018-03-15 Published:2018-03-15

摘要: 传统的搜索引擎仅返回给用户包含查询关键字的文档,忽略了查询背后用户真正的信息需求。为此,将文档检索看作个性化推荐问题,提出一种查询意图识别的主题模型个性化检索算法。对用户检索历史进行潜在狄利克雷分布主题建模,结合检索历史主题模型识别用户查询的潜在意图,并按主题相关度进行文档推荐,计算查询到文档集的KL距离对文档集排序,最终返回给用户个性化检索文档列表。实验结果表明,与基于协同相似计算和基于用户聚类的推荐算法相比,该算法能够更准确有效地为用户提供个性化检索。

关键词: 搜索引擎, 查询意图, 文档检索, 个性化推荐, 主题模型, 潜在狄利克雷分布, KL距离

Abstract: Conventional search engines collect documents which only contain key words in the query, but not considering the true intent hidden inside its users.Aiming at this problem,taking the document retrieval as a personalized recommendation problem,this paper proposes a personalized retrieval algorithm based on query intent identification and topic model.First,the topic model of Dirichlet Distribution Allocation(LDA) is applied for modeling the historical search data of its user.When a new query comes,latent topic of the query is recognized by the topic model of the historical search of its user,and then appropriate documents are recommended for the correlation of topics.Finally,the KL distance between the query and document sets is calculated,and the documents returning to the user are sorted according to the distance.Experimental results show that the proposed algorithm is better than the method based on collaborative similarity calculation and the method based on user interest clustering on efficiency.

Key words: search engine, query intent, document retrieval, personalized recommendation, topic model, Latent Dirichlet Allocation(LDA), KL distance

中图分类号: