计算机工程

• 人工智能及识别技术 • 上一篇    下一篇

基于潜在语义与图结构的微博语义检索

肖宝1,李璞2,3,胡娇娇2,蒋运承2   

  1. (1.钦州学院 电子与信息工程学院,广西 钦州 535000; 2.华南师范大学 计算机学院,广州 510631;3.郑州轻工业学院 软件学院,郑州 450000)
  • 收稿日期:2016-10-28 出版日期:2017-06-15 发布日期:2017-06-15
  • 作者简介:肖宝(1981—),男,讲师、硕士,主研方向为机器学习、语义Web;李璞,博士研究生;胡娇娇,硕士研究生;蒋运承,教授、博士、博士生导师。
  • 基金项目:
    国家自然科学基金(61272066);广西高校中青年教师基础能力提升项目(KY2016LX431);广州市科技计划项目(2014 J4100031);钦州市科学研究与技术开发计划项目(20164407)。

Microblog Semantic Retrieval Based on Latent Semantic and Graph Structure

XIAO Bao 1,LI Pu 2,3,HU Jiaojiao 2,JIANG Yuncheng 2   

  1. (1.School of Electronics and Information Engineering,Qinzhou University,Qinzhou,Guangxi 535000,China; 2.School of Computer,South China Normal University,Guangzhou 510631,China;3.Software Engineering College,Zhengzhou University of Light Industry,Zhengzhou 450000,China)
  • Received:2016-10-28 Online:2017-06-15 Published:2017-06-15

摘要: 微博文本短小、特征稀疏、与用户查询之间存在语义鸿沟的特点会降低语义检索效率。针对该问题,结合文本特征和知识库语义,构建基于潜在语义与图结构的语义检索模型。通过Tversky算法计算基于Hashtag的特征相关度;利用隐含狄利克雷分布算法对Wikipedia语料库训练主题模型,基于JSD距离计算映射到该模型的文本主题相关度;抽取DBpedia中实体及其网络关系连接图,使用SimRank算法计算图中实体间的相关度。综合以上3个结果得到最终相关度。通过短文本和长文本检索对Twitter子集进行实验,结果表明,与基于开放关联数据和图论的方法相比,该模型在评估指标MAP,P@30,R-Prec上分别提高了2.98%,6.40%,5.16%,具有较好的检索性能。

关键词: 微博, 文本相关度, 图结构, 隐含狄利克雷分布, 语义检索

Abstract: The characteristics of microblog such as short text,sparse feature and the semantic gap between users’ query may reduce semantic retrieval efficiency.Aiming at these problems,taking into account both text feature and semantic of knowledge base,a semantic retrieval model based on latent semantics and graph structure is proposed.Firstly,Tversky algorithm is employed to measure feature relatedness by taking Hashtag as feature;Secondly,a topic model is trained by Latent Dirichlet Allocation(LDA)for Wikipedia,and text topic relatedness mapped to this model is calculated by JSD;Finally,the connection graph of entity and its network relation are extracted in DBpedia.SimRank is employed to measure relatedness between two entities.The three types of relatednesses calculated in previous steps are used to compute a final relatedness.Twitter subsets for short and long queries are used in experiment.Experimental results show that,compared with the method based on linked open data and graph-based theory,the proposed model improves MAP,P@30,R-Prec by 2.98%,6.40%,5.16% respectively,which means that it has better retrieval perfermance.

Key words: microblog, text relatedness, graph structure, Latent Dirichlet Allocation(LDA), semantic retrieval

中图分类号: