基于潜在语义与图结构的微博语义检索

doi:10.3969/j.issn.1000-3428.2017.06.029

计算机工程

基于潜在语义与图结构的微博语义检索

肖宝¹,李璞^2,3,胡娇娇²,蒋运承²

(1.钦州学院电子与信息工程学院,广西钦州 535000; 2.华南师范大学计算机学院,广州 510631;3.郑州轻工业学院软件学院,郑州 450000)

收稿日期:2016-10-28 出版日期:2017-06-15 发布日期:2017-06-15
作者简介:肖宝(1981—),男,讲师、硕士,主研方向为机器学习、语义Web;李璞,博士研究生;胡娇娇,硕士研究生;蒋运承,教授、博士、博士生导师。
基金资助:
国家自然科学基金(61272066);广西高校中青年教师基础能力提升项目(KY2016LX431);广州市科技计划项目(2014 J4100031);钦州市科学研究与技术开发计划项目(20164407)。

Microblog Semantic Retrieval Based on Latent Semantic and Graph Structure

XIAO Bao ¹,LI Pu ^2,3,HU Jiaojiao ²,JIANG Yuncheng ²

(1.School of Electronics and Information Engineering,Qinzhou University,Qinzhou,Guangxi 535000,China; 2.School of Computer,South China Normal University,Guangzhou 510631,China;3.Software Engineering College,Zhengzhou University of Light Industry,Zhengzhou 450000,China)

Received:2016-10-28 Online:2017-06-15 Published:2017-06-15

摘要/Abstract

摘要： 微博文本短小、特征稀疏、与用户查询之间存在语义鸿沟的特点会降低语义检索效率。针对该问题,结合文本特征和知识库语义,构建基于潜在语义与图结构的语义检索模型。通过Tversky算法计算基于Hashtag的特征相关度;利用隐含狄利克雷分布算法对Wikipedia语料库训练主题模型,基于JSD距离计算映射到该模型的文本主题相关度;抽取DBpedia中实体及其网络关系连接图,使用SimRank算法计算图中实体间的相关度。综合以上3个结果得到最终相关度。通过短文本和长文本检索对Twitter子集进行实验,结果表明,与基于开放关联数据和图论的方法相比,该模型在评估指标MAP,P@30,R-Prec上分别提高了2.98%,6.40%,5.16%,具有较好的检索性能。

关键词: 微博, 文本相关度, 图结构, 隐含狄利克雷分布, 语义检索

Abstract: The characteristics of microblog such as short text,sparse feature and the semantic gap between users’ query may reduce semantic retrieval efficiency.Aiming at these problems,taking into account both text feature and semantic of knowledge base,a semantic retrieval model based on latent semantics and graph structure is proposed.Firstly,Tversky algorithm is employed to measure feature relatedness by taking Hashtag as feature;Secondly,a topic model is trained by Latent Dirichlet Allocation(LDA)for Wikipedia,and text topic relatedness mapped to this model is calculated by JSD;Finally,the connection graph of entity and its network relation are extracted in DBpedia.SimRank is employed to measure relatedness between two entities.The three types of relatednesses calculated in previous steps are used to compute a final relatedness.Twitter subsets for short and long queries are used in experiment.Experimental results show that,compared with the method based on linked open data and graph-based theory,the proposed model improves MAP,P@30,R-Prec by 2.98%,6.40%,5.16% respectively,which means that it has better retrieval perfermance.

Key words: microblog, text relatedness, graph structure, Latent Dirichlet Allocation(LDA), semantic retrieval

中图分类号:

TP18

肖宝,李璞,胡娇娇,蒋运承. 基于潜在语义与图结构的微博语义检索[J]. 计算机工程.

XIAO Bao,LI Pu,HU Jiaojiao,JIANG Yuncheng. Microblog Semantic Retrieval Based on Latent Semantic and Graph Structure[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2017/V43/I6/182

参考文献

参考文献［1］Teevan J,Ramage D,Morris M R.# Twitter Search:A Comparison of Microblog Search and Web Search［C］//Proceedings of the 4th ACM International Conference on Web Search and Data Mining.New York,USA:ACM Press,2011:35-44. ［2］Efron M.Hashtag Retrieval in a Microblogging Environ-ment［C］//Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2010:787-788. ［3］Cheong M,Lee V.Integrating Web-based Intelligence Retrieval and Decision-making from the Twitter Trends Knowledge Base［C］//Proceedings of the 2nd ACM Workshop on Social Web Search and Mining.New York,USA:ACM Press,2009:1-8. ［4］Kalloubi F,Nfaoui E H.Microblog Semantic Context Retrieval System Based on Linked Open Data and Graph-based Theory［J］.Expert Systems with Applications,2016,53:138-148. ［5］Yan Xiao,Guo Jiafeng,Lan Yanyan,et al.A Biterm Topic Model for Short Texts［C］//Proceedings of the 22nd International Conference on World Wide Web.New York,USA:ACM Press,2013:1445-1456. ［6］Banerjee S,Ramanathan K,Gupta A.Clustering Short Texts Using Wikipedia［C］//Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York,USA:ACM Press,2007:787-788. ［7］Mendes P N,Passant A,Kapanipathi P,et al.Linked Open Social Signals［C］//Proceedings of 2010 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology.Washington D.C.,USA:IEEE Press,2010:224-231. ［8］Tang Jiliang,Wang Xufei,Gao Huiji,et al.Enriching Short Text Representation in Microblog for Clustering［J］.Frontiers of Computer Science,2012,6(1):88-101. ［9］Meij E,Weerkamp W,de Rijke M.Adding Semantics to Microblog Posts［C］//Proceedings of the 5th ACM International Conference on Web Search and Data Mining.New York,USA:ACM Press,2012:563-572. ［10］Abel F,Gao Qi,Houben G J,et al.Semantic Enrichment of Twitter Posts for User Profile Construction on the Social Web［C］//Proceedings of Extended Semantic Web Conference.Berlin,Germany:Springer,2011:375-389. ［11］Laniado D,Mika P.Making Sense of Twitter［C］//Proceedings of International Semantic Web Conference.Berlin,Germany:Springer,2010:470-485. ［12］Guo Yuhang,Qin Bing,Liu Ting,et al.Microblog Entity Linking by Leveraging Extra Posts［C］//Proceedings of 2013 Conference on Empirical Methods in Natural Language Processing.Seattle,USA:［s.n.］,2013:863-868. ［13］He J,de Rijke M,Sevenster M,et al.Generating Links to Background Knowledge:A Case Study Using Narrative Radiology Reports［C］//Proceedings of the 20th ACM International Conference on Information and Knowledge Management.New York,USA:ACM Press,2011:1867-1876. ［14］Saif A,Aziz M J A,Omar N.Reducing Explicit Semantic Representation Vectors Using Latent Dirichlet Allocation［J］.Knowledge-Based Systems,2016,100(C):145-159. ［15］Efron M,Winget M.Query Polyrepresentation for Ranking Retrieval Systems Without Relevance Judg-ments［J］.Journal of the American Society for Information Science and Technology,2010,61(6):1081-1091. ［16］Abel F,Celik I,Houben G J,et al.Leveraging the Semantics of Tweets for Adaptive Faceted Search on Twitter［C］//Proceedings of International Semantic Web Conference.Berlin,Germany:Springer,2011:1-17. ［17］Lau C H,Tao Xiaohui,Tjondronegoro D,et al.Retrieving Information from Microblog Using Pattern Mining and Relevance Feedback［M］//Xiang Yang,Pathan M,Tao Xiaohui.Data and Knowledge Engineering.Berlin,Germany:Springer,2012:152-160. ［18］Tao Ke,Abel F,Hauff C,et al.Twinder:A Search Engine for Twitter Streams［C］//Proceedings of International Conference on Web Engineering.Berlin,Germany:Springer,2012:153-168. ［19］Vicient C,Moreno A.Unsupervised Topic Discovery in Micro-blogging Networks［J］.Expert Systems with Applications,2015,42(17):6472-6485. ［20］Liang Shangsong,Ren Zhaochun,de Rijke M.The Impact of Semantic Document Expansion on Cluster-based Fusion for Microblog Search［C］//Proceedings of European Conference on Information Retrieval.Berlin,Germany:Springer,2014:493-499. ［21］Lu Kuang,Roa D,Fang Hui.Concept Based Tie-breaking and Maximal Marginal Relevance Retrieval in Microblog Retrieval［C］//Proceedings of the 23rd Text Retrieval Conference.Gaithersburg,USA:［s.n.］,2014. (下转第194页) (上接第188页) ［22］卫冰洁,史亮,王斌.一种融合聚类和时间信息的微博排序新方法［J］.中文信息学报,2015,29(3):177-183,189. ［23］唐晓波,房小可.基于文本聚类与LDA相融合的微博主题检索模型研究［J］.情报理论与实践,2013,36(8):85-90. ［24］Blei D M,Ng A Y,Jordan M I.Latent Dirichlet Allocation［J］.Journal of Machine Learning Research,2003,3(2):993-1022. ［25］Jeh G,Widom J.SimRank:A Measure of Structural-context Similarity［C］//Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2002:538-543. ［26］Meymandpour R,Davis J G.Recommendations Using Linked Data［C］//Proceedings of the 5th Ph.D.Workshop on Information and Knowledge.New York,USA:ACM Press,2012:75-82. ［27］Yu Weiren,Zhang Wenjie,Lin Xuemin,et al.A Space and Time Efficient Algorithm for SimRank Computation［J］.World Wide Web,2012,15(3):327-353. ［28］Fujiwara Y,Nakatsuji M,Shiokawa H,et al.Efficient Search Algorithm for SimRank［C］//Proceedings of the 29th IEEE International Conference on Data Engineering.Washington D.C.,USA:IEEE Press,2013:589-600. ［29］Li Rui,Wang Shengjie,Deng Hongbo,et al.Towards Social User Profiling:Unified and Discriminative Influence Model for Inferring Home Locations［C］//Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2012:1023-1031. ［30］Smucker M D,Allan J,Carterette B.A Comparison of Statistical Significance Tests for Information Retrieval Evaluation［C］//Proceedings of the 16th ACM Con-ference on Information and Knowledge Management.New York,USA:ACM Press,2007:623-632. 编辑金胡考

[1]	张财, 马自强, 闫博. 基于机器学习的政务微博情感分析模型设计[J]. 计算机工程, 2024, 50(12): 386-395.
[2]	马坤, 安敬民, 李冠宇. 动态聚合实体和关系上下文的知识图谱补全[J]. 计算机工程, 2023, 49(8): 77-84, 95.
[3]	谢毅, 王强, 李海宏, 金诚, 任洪润, 薛雯, 熊贇. 一种基于时空稀疏注意力的时空图挖掘算法[J]. 计算机工程, 2023, 49(4): 108-113.
[4]	李琴, 李少波, 胡杰. 基于主题情感联合分析的游客画像研究[J]. 计算机工程, 2022, 48(6): 278-287,294.
[5]	胡承佐, 王庆梅, 李迪超, 王铮. 基于复杂结构信息的图神经网络序列推荐算法[J]. 计算机工程, 2022, 48(5): 82-90,97.
[6]	高永兵, 黎预璇, 高军甜, 马占飞. 基于用户意图的微博文本生成技术研究[J]. 计算机工程, 2022, 48(1): 119-126.
[7]	王健宗, 孔令炜, 黄章成, 肖京. 图神经网络综述[J]. 计算机工程, 2021, 47(4): 1-12.
[8]	李妍慧, 郑超美, 王炜立, 杨昕. 一种混合语种文本的多维度多情感分析方法[J]. 计算机工程, 2020, 46(12): 113-119,141.
[9]	黄贤英, 阳安志, 刘小洋, 刘广峰. 一种改进的微博用户影响力评估算法[J]. 计算机工程, 2019, 45(12): 294-299.
[10]	张聪, 易秀双, 朱明浩, 王兴伟. 基于Spark的学术研究热点挖掘方法[J]. 计算机工程, 2019, 45(12): 171-175.
[11]	周福星, 陈秀真, 马进, 李生红. 一种融合标签语义的微博热点话题挖掘方法[J]. 计算机工程, 2019, 45(10): 283-287.
[12]	李志欣,兰丹媚,张灿龙,唐素勤. 基于Co-Training的微博垃圾评论识别方法[J]. 计算机工程, 2018, 44(7): 212-218.
[13]	高永兵,杨利莹,胡文江,马占飞. 基于HDP模型的领域微博主题演化研究[J]. 计算机工程, 2018, 44(2): 1-8.
[14]	刁劼庭,傅秀芬. 微博谣言免疫策略的研究[J]. 计算机工程, 2017, 43(5): 294-298.
[15]	郭竹为,刘胜全,刘艳,赵美玲,符贤哲. 基于最大公共子图的本体映射方法研究[J]. 计算机工程, 2017, 43(5): 197-203,209.

选择文件类型/文献管理软件名称

选择包含的内容

基于潜在语义与图结构的微博语义检索

Microblog Semantic Retrieval Based on Latent Semantic and Graph Structure

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于潜在语义与图结构的微博语义检索

Microblog Semantic Retrieval Based on Latent Semantic and Graph Structure

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价