Federated Knowledge Graph Query Based on Graph Structure Feature Sampling Data Summary

doi:10.19678/j.issn.1000-3428.0063640

Computer Engineering ›› 2023, Vol. 49 ›› Issue (1): 73-81. doi: 10.19678/j.issn.1000-3428.0063640

• Artificial Intelligence and Pattern Recognition • Previous Articles Next Articles

Federated Knowledge Graph Query Based on Graph Structure Feature Sampling Data Summary

GAO Feng^1,2,3,4, LI Qiu^1,2,3,4, GU Jinguang^1,2,3,4

1. School of Computer Science and Technology, Wuhan University of Science and Technology, Wuhan 430065, China;
2. Hubei Province Key Laboratory of Intelligent Information Processing and Real-time Industrial System, Wuhan 430065, China;
3. Big Data Science and Engineering Research Institute, Wuhan University of Science and Technology, Wuhan 430065, China;
4. Key Laboratory of Rich Media Digital Publishing Content Organization and Knowledge Service, National Press and Publication Administration, Beijing 100083, China

Received:2021-12-28 Revised:2022-02-01 Published:2022-03-22

基于图结构特征采样数据摘要的联邦知识图谱查询

高峰^1,2,3,4, 李秋^1,2,3,4, 顾进广^1,2,3,4

1. 武汉科技大学计算机科学与技术学院, 武汉 430065;
2. 湖北省智能信息处理与实时工业系统重点实验室, 武汉 430065;
3. 武汉科技大学大数据科学与工程研究院, 武汉 430065;
4. 国家新闻出版署富媒体数字出版内容组织与知识服务重点实验室, 北京 100083

作者简介:高峰(1986-),男,讲师、博士,主研方向为知识图谱、智能信息处理、语义网;李秋,硕士研究生;顾进广,教授、博士。
基金资助:
国家科技创新2030—“新一代人工智能”重大项目（2020AAA0108500）；国家自然科学基金（U1836118）；富媒体数字出版内容组织与知识服务重点实验室开放基金（ZD2021-11/01）。

Abstract

Abstract: The federated system processes SPARQL queries by constructing an effective query plan to guide query execution.The data summary index file captures the structure and semantic information of Resource Description Framework(RDF) datasets, essential for the cardinality evaluation of subqueries during query plan generation.Existing data summary generation methods need to traverse the complete data of each source remotely, which consumes a high cost.In most environments, the federated query cannot complete the statistics of large datasets.This study proposes a method for generating the approximate data summary of the original graph based on the sample graph to solve this defect.The aim is to capture the actual count information as much as possible while reducing the generation time and memory overhead of the data summary index file. Specifically, this method first uses the sampling method of weighting the degree feature of the RDF graph to obtain the typical sample of the original graph.Next, the improved mapping function reflects the information in the sample graph to the original graph to generate the approximate data summary of the original graph. During this process, the distribution deviation of the subject and predicate is considered in this method.The experimental results show that the proposed method saves at least 70% of the generation time of the data summary index file compared with the Baseline method.In addition, the approximate data summary generated only from 0.5% of the original graph is highly consistent with the Baseline method in query accuracy.

Key words: data summary, data source index, RDF graph sampling, federation query, query performance

摘要： 联邦SPARQL查询是通过构建查询计划来指导查询执行，数据摘要索引文件捕获了RDF数据集的结构和语义信息，对查询计划生成过程中子查询基数评估至关重要。现有的数据摘要生成方法需要远程遍历每个数据源的完整数据，该过程成本消耗较高，且在大部分环境中联邦查询无法完成对大数据集的统计工作。为在减少数据摘要索引文件生成时间和内存开销的同时捕获尽可能真实的计数信息，考虑主语和谓语的分布偏差，提出利用样图生成原始图近似数据摘要的方法。使用对RDF图出度特征加权的采样方法获取原始图的典型样图，通过改进的映射函数将样图中的信息映射到原始图上，从而生成原始图的近似数据摘要。实验结果表明，该方法相比于基线方法至少节省了70%的数据摘要索引文件生成时间，并且仅采样0.5%的原始图生成的近似数据摘要即可在查询正确率上与基线方法保持高度一致。

关键词: 数据摘要, 数据源索引, RDF图采样, 联邦查询, 查询性能

CLC Number:

TP319

GAO Feng, LI Qiu, GU Jinguang. Federated Knowledge Graph Query Based on Graph Structure Feature Sampling Data Summary[J]. Computer Engineering, 2023, 49(1): 73-81.

高峰, 李秋, 顾进广. 基于图结构特征采样数据摘要的联邦知识图谱查询[J]. 计算机工程, 2023, 49(1): 73-81.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0063640

http://www.ecice06.com/EN/Y2023/V49/I1/73

Figures/Tables 6

References

[1] ČEBIRIĆ Š, GOASDOUÉ F, KONDYLAKIS H, et al.Summarizing semantic graphs:a survey[J].The VLDB Journal, 2019, 28(3):295-327.
[2] MEIMARIS M, PAPASTEFANATOS G, MAMOULIS N, et al.Extended characteristic sets:graph indexing for SPARQL query optimization[C]//Proceedings of the 33rd International Conference on Data Engineering.Washington D.C., USA:IEEE Press, 2017:497-508.
[3] SALEEM M, POTOCKI A, SORU T, et al.CostFed:cost-based query optimization for SPARQL endpoint federation[J].Procedia Computer Science, 2018, 137:163-174.
[4] MONTOYA G, SKAF-MOLLI H, HOSE K.The odyssey approach for optimizing federated SPARQL queries[C]//Proceedings of ISWC'17.Berlin, Germany:Springer, 2017:471-489.
[5] QUDUS U, SALEEM M, NGONGA N, et al.An empirical evaluation of cost-based federated SPARQL query processing engines[J].Semantic Web, 2021, 12(6):843-868.
[6] ZAVERI A, RULA A, MAURINO A, et al.Quality assessment for linked data:a survey[J].Semantic Web, 2015, 7(1):63-93.
[7] ČEBIRIĆ Š, GOASDOUÉ F, MANOLESCU I.Query-oriented summarization of RDF graphs[J].Proceedings of the VLDB Endowment, 2015, 8(12):2012-2015.
[8] PHAM M D, PASSING L, ERLING O, et al.Deriving an emergent relational schema from RDF data[C]//Proceedings of the 24th International Conference on World Wide Web.Geneva, Switzerland:International World Wide Web Conferences Steering Committee, 2015:864-874.
[9] OZKAN E C, SALEEM M, DOGDU E, et al.UPSP:unique predicate-based source selection for SPARQL endpoint federation[EB/OL].[2021-11-07].http://ceur-ws.org/Vol-1597/PROFILES2016_paper4.pdf.
[10] HELING L.Quality-driven query processing over federated RDF data sources[C]//Proceedings of ESWC'19.Berlin, Germany:Springer, 2019:210-216.
[11] HELING L, ACOSTA M.Characteristic sets profile features:estimation and application to SPARQL query planning[EB/OL].[2021-11-07].https://content.iospress.com/articles/semantic-web/sw222903.
[12] CHARALAMBIDIS A, TROUMPOUKIS A.SemaGrow:optimizing federated SPARQL queries[C]//Proceedings of the 11th International Conference on Semantic Systems.New York, USA:ACM Press, 2015:121-128
[13] LIU M L, ÖZSU T.Encyclopedia of database systems[M].2nd ed.Berlin, Germany:Springer, 2018.
[14] SAKR S, ZOMAYA A.Encyclopedia of big data technologies[M].Berlin, Germany:Springer, 2018.
[15] GRUBENMANN T, BERNSTEIN A, MOOR D, et al.Challenges of source selection in the WoD[C]//Proceedings of International Semantic Web Conference.Berlin, Germany:Springer, 2017:313-321.
[16] HELING L, ACOSTA M.Cost- and robustness-based query optimization for linked data fragments[C]//Proceedings of International Semantic Web Conference.Berlin, Germany:Springer, 2020:238-248.
[17] RIETVELD L, HOEKSTRA R, SCHLOBACH S, et al.Structural properties as proxy for semantic relevance in RDF graph sampling[M].Berlin, Germany:Springer, 2014.
[18] ELLEFI M B, BELLAHSENE Z, BRESLIN J G, et al.RDF dataset profiling-a survey of features, methods, vocabularies and applications[J].Semantic Web, 2018, 9(5):677-705.
[19] FERNANDEZ J D, MARTINEZ-PRIETO M A, REDONDO P D, et al.Characterising RDF data sets[J].Journal of Information Science, 2018, 44(2):203-229.
[20] AUER S, DEMTER J, MARTIN M, et al.LODStats-an extensible framework for high-performance dataset analytics[C]//Proceedings of International Conference on Knowledge Engineering and Knowledge Management.Berlin, Germany:Springer, 2012:353-362.
[21] KHATCHADOURIAN S, CONSENS M P.ExpLOD:summary-based exploration of interlinking and RDF usage in the linked open data cloud[C]//Proceedings of ESWC'20.Berlin, Germany:Springer, 2020:1-19.
[22] DEBATTISTA J, LONDONO S, LANGE C, et al.Quality assessment of linked datasets using probabilistic approximation[M].Berlin, Germany:Springer, 2015.
[23] SOULET A, SUCHANEK F M.Anytime large-scale analytics of linked open data[M].Berlin, Germany:Springer, 2019.
[24] LESKOVEC J, FALOUTSOS C.Sampling from large graphs[C]//Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York, USA:ACM Press, 2006:631-636.
[25] RIBEIRO B, WANG P H, MURAI F, et al.Sampling directed graphs with random walks[C]//Proceedings of IEEE INFOCOM'12.Washington D.C., USA:IEEE Press, 2012:1692-1700.
[26] SALEEM M, HASNAIN A, NGONGA NGOMO A C.LargeRDFBench:a billion triples benchmark for SPARQL endpoint federation[J].Journal of Web Semantics, 2018, 48:85-125.
[27] MOERKOTTE G, NEUMANN T, STEIDL G.Preventing bad plans by bounding the impact of cardinality estimation errors[J].Proceedings of the VLDB Endowment, 2009, 2(1):982-993.

Please choose a citation manager

Content to export