一种面向大规模知识图谱的混合存储方案

doi:10.19678/j.issn.1000-3428.0252059

摘要/Abstract

摘要：

知识图谱作为人工智能领域的关键数据组织形式, 在大数据与大模型蓬勃发展的当下, 被广泛应用于众多领域。随着知识图谱规模不断扩大, 现有存储结构暴露出数据导入速度慢、存储空间占用大等问题。为此, 提出一种关系型+键值对的混合存储方案, 并设计基于属性频率的实体聚类算法。该方案借助基于属性频率的实体聚类算法, 对不同属性频率的实体簇进行分类。对于高频属性, 利用关系型数据库存储, 发挥其查询效率高的优势; 对于稀有属性, 采用键值对形式存储, 以展现键值对存储在处理稀疏数据时的灵活性。这种设计有效规避了关系型存储面对稀疏数据时产生大量空值的弊端, 减少了键值对存储中键的重复存储问题, 在确保数据灵活性的同时显著提升了存储效率。在合成数据集和真实数据集上的实验结果显示, 与现有方案相比, 该方案在真实数据集上存储空间节省50%以上, 数据导入速度提升1个数量级, 且查询效率保持不变。这充分说明了该方案有效地解决了大规模知识图谱的存储难题, 为知识图谱在各个领域的广泛应用提供了有力的存储支持, 具有重要的理论意义和实际应用价值。

关键词: 知识图谱, 资源描述框架图, 属性图, 关系型数据库, 数据存储

Abstract:

Knowledge graphs, a crucial form of data organization in the field of artificial intelligence, are widely applied across numerous domains with the increased development of big data and large-scale models. As the scale of knowledge graphs continues to expand, existing storage structures have encountered challenges such as slow data ingestion and excessive storage space occupation. To address these issues, this paper proposes a hybrid storage scheme based on relational+key-value and designs an entity clustering algorithm based on attribute frequency. This scheme utilizes an attribute-frequency-based clustering algorithm to classify entity clusters. By combining the proposed scheme and algorithm, high-frequency attributes are stored in a relational manner and rare attributes are stored in a key-value pair manner. This design effectively mitigates the drawbacks of relational storage (such as generating excessive NULL values when handling sparse data) while reducing key duplication issues inherent in key-value storage and significantly improves storage efficiency without compromising data flexibility. Experiments on synthetic and real-world datasets show that compared to existing schemes, the proposed scheme can save over 50% of storage space on real-world datasets, increases the data ingestion speed by an order of magnitude, and this scheme has no significant impact on query performance, thus effectively solving the storage challenges of large-scale knowledge graphs, providing strong storage support for the wide application of knowledge graphs across various fields, and having important theoretical significance and practical value.

Key words: knowledge graph, Resource Description Framework (RDF) graph, property graph, relational database, data storage

游奕桁, 王鑫, 马梦露, 王惠. 一种面向大规模知识图谱的混合存储方案[J]. 计算机工程, 2025, 51(12): 43-55.

YOU Yiheng, WANG Xin, MA Menglu, WANG Hui. A Hybrid Storage Scheme for Large-scale Knowledge Graphs[J]. Computer Engineering, 2025, 51(12): 43-55.

https://www.ecice06.com/CN/Y2025/V51/I12/43

图/表 14

图1 RDF图示例

Fig.1 An RDF graph example

图2 属性图示例

Fig.2 A property graph example

图3 KGHS总体架构

Fig.3 Overall architecture of KGHS

图4 KGHS数据模型

Fig.4 KGHS data model

图5 LDBC数据集导入测试结果

Fig.5 Results of ingestion test on LDBC dataset

图6 LDBC数据集提供的21条查询测试结果对比

Fig.6 Comparison of 21 queries test results provided by LDBC dataset

图7 LUBM数据集导入测试结果

Fig.7 Results of ingestion test on LUBM dataset

图8 DBpedia数据集导入测试结果

Fig.8 Results of ingestion test on DBpedia dataset

图9 LUBM8K数据集提供的14条查询测试结果对比

Fig.9 Comparison of 14 queries test results provided by LUBM8K dataset

图10 DBPedia数据集上不同频率阈值下存储性能对比

Fig.10 Comparison of storage performance under different frequency thresholds on DBPedia dataset

图11 DBPedia数据集导入能耗对比

Fig.11 Comparison of energy consumption of ingestion on DBPedia dataset

参考文献 25

1	王鑫, 邹磊, 王朝坤, 等. 知识图谱数据管理研究综述. 软件学报, 2019, 30 (7): 2139- 2174.
	WANG X , ZOU L , WANG C K , et al. Research on knowledge graph data management: a survey. Journal of Software, 2019, 30 (7): 2139- 2174.
2	朱迪, 张博闻, 程雅琪, 等. 知识赋能的新一代信息系统研究现状、发展与挑战. 软件学报, 2023, 34 (10): 4439- 4462.
	ZHU D , ZHANG B W , CHENG Y Q , et al. Survey on knowledge enabled new generation information systems. Journal of Software, 2023, 34 (10): 4439- 4462.
3	KLYNE G, CARROLL J J, MCBRIDE B, et al. RDF 1.1 concepts and abstract syntax[EB/OL]. [2024-12-08]. https://www.w3.org/TR/rdf11-concepts/.
4	The W3C SPARQL Working Group. SPARQL 1.1 overview[EB/OL]. [2024-12-08]. https://www.w3.org/TR/2013/REC-sparql11-overview-20130321/.
5	ANGLES R. The property graph database model[EB/OL]. [2024-12-08]. https://www.ceur-ws.org/Vol-2100/paper26.pdf.
6	SAKR S , AL-NAYMAT G . Relational processing of RDF queries: a survey. ACM SIGMOD Record, 2010, 38 (4): 23- 28. doi: 10.1145/1815948.1815953
7	HARRIS S, GIBBINS N. 3store: efficient bulk RDF storage[EB/OL]. [2024-12-08]. https://www.ceur-ws.org/Vol-2100/paper26.pdf.
8	PAN Z, HEFLIN J. DLDB: extending relational databases to support semantic Web queries[EB/OL]. [2024-12-08]. https://swat.cse.lehigh.edu/pubs/pan04a.pdf.
9	WILKINSON K, WILKINSON K. Jena property table implementation[EB/OL]. [2024-12-08]. https://shiftleft.com/mirrors/www.hpl.hp.com/techreports/2006/HPL-2006-140.pdf.
10	ABADI D J, MARCUS A, MADDEN S R, et al. Scalable semantic Web data management using vertical partitioning[C]//Proceedings of the 33rd International Conference on Very Large Data Bases. New York, USA: ACM Press, 2007: 411-422.
11	NEUMANN T , WEIKUM G . RDF-3X. Proceedings of the VLDB Endowment, 2008, 1 (1): 647- 659. doi: 10.14778/1453856.1453927
12	WEISS C , KARRAS P , BERNSTEIN A . Hexastore. Proceedings of the VLDB Endowment, 2008, 1 (1): 1008- 1019. doi: 10.14778/1453856.1453965
13	BORNEA M A, DOLBY J, KEMENTSIETSIDIS A, et al. Building an efficient RDF store over a relational database[C]//Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2013: 121-132.
14	SUN W, FOKOUE A, SRINIVAS K, et al. SQLGraph: an efficient relational-based property graph store[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 1887-1901.
15	刘宝珠, 王鑫, 柳鹏凯, 等. KGDB: 统一模型和语言的知识图谱数据库管理系统. 软件学报, 2021, 32 (3): 781- 804.
	LIU B Z , WANG X , LIU P K , et al. KGDB: knowledge graph database system with unified model and query language. Journal of Software, 2021, 32 (3): 781- 804.
16	Neo4j—the world's leading graph database[EB/OL]. [EB/OL]. [2024-12-08]. http://neo4j.org/.
17	ZOU L , ÖZSU M T , CHEN L , et al. gStore: a graph-based SPARQL query engine. The VLDB Journal, 2014, 23 (4): 565- 590. doi: 10.1007/s00778-013-0337-7
18	YU Q L , GUO C , ZHUANG J , et al. CaaS-LSM: compaction-as-a-service for LSM-based key-value stores in storage disaggregated infrastructure. Proceedings of the ACM on Management of Data, 2024, 2 (3): 1- 28.
19	LI H F , TAO Q , YU S , et al. GastCoCo: graph storage and coroutine-based prefetch co-design for dynamic graph processing. Proceedings of the VLDB Endowment, 2024, 17 (13): 4827- 4839. doi: 10.14778/3704965.3704986
20	GUO A X, LI J W, SUKPRASERT P, et al. To store or not to store: a graph theoretical approach for dataset versioning[C]//Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS). Washington D.C., USA: IEEE Press, 2024: 479-493.
21	LI G L , ZHOU X H , SUN J , et al. openGauss. Proceedings of the VLDB Endowment, 2021, 14 (12): 3028- 3042. doi: 10.14778/3476311.3476380
22	ERLING O . Virtuoso, a hybrid RDBMS/graph column store. IEEE Database Engineering Bulletin, 2012, 35 (1): 3- 8.
23	ERLING O, AVERBUCH A, LARRIBA-PEY J, et al. The LDBC social network benchmark: interactive workload[C]//Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 619-630.
24	GUO Y B , PAN Z X , HEFLIN J . LUBM: a benchmark for OWL knowledge base systems. Journal of Web Semantics, 2005, 3 (2/3): 158- 182.
25	AUER S, BIZER C, KOBILAROV G, et al. DBpedia: a nucleus for a Web of open data[C]//Proceedings of the Semantic Web Conference. Berlin, Germany: Springer, 2007: 722-735.

[1]	符家成, 田瑾, 张玉金, 方志军. 结合前置三元组集的知识图谱推荐[J]. 计算机工程, 2025, 51(9): 101-109.
[2]	徐式芃, 王雷, 盛捷. 基于知识图谱的异常个体提前识别模型研究[J]. 计算机工程, 2025, 51(9): 59-70.
[3]	肖珂, 刘颖, 何云华, 徐刚, 王超. 基于多链的能源数据链上链下安全检索方案[J]. 计算机工程, 2025, 51(8): 238-249.
[4]	刘文杰, 陈亮, 任智杰. 基于图神经网络与元学习的小样本关系推理模型[J]. 计算机工程, 2025, 51(5): 124-132.
[5]	刘春雨, 陈庆锋, 莫少聪, 谢泽. 基于逻辑规则和图神经网络的知识图谱补全[J]. 计算机工程, 2025, 51(3): 131-143.
[6]	朱红, 王阔然, 朱彤. 基于多侧面信息表征联合的实体相似性度量及对齐方法[J]. 计算机工程, 2025, 51(3): 64-75.
[7]	马恒志, 钱育蓉, 冷洪勇, 吴海鹏, 陶文彬, 张依杨. 知识图谱嵌入研究进展综述[J]. 计算机工程, 2025, 51(2): 18-34.
[8]	张晓明, 陈通庆, 王会勇. 基于图像置信度动态引导的多模态实体对齐[J]. 计算机工程, 2025, 51(12): 140-150.
[9]	崔焕庆, 吴一凡, 董柯桢, 周升庆. 基于标签传播与沙丘猫群优化的属性图划分算法[J]. 计算机工程, 2025, 51(12): 180-188.
[10]	孙丽郡, 孟繁军, 徐行健. 课程知识图谱构建技术研究综述[J]. 计算机工程, 2025, 51(11): 1-21.
[11]	郑洁云, 张章煌, 宣菊琴, 魏鑫, 薛静玮. 基于知识图谱和图卷积神经网络的配电网智能规划方法[J]. 计算机工程, 2025, 51(11): 392-402.
[12]	李文浩, 张东, 李冠宇. ComHA: 融合几何变换与层次结构的知识图谱嵌入模型[J]. 计算机工程, 2025, 51(11): 123-132.
[13]	刘海, 石佛波, 张昭理, 何嘉文, 李家豪. 基于文本和多视角局部结构特征的知识图谱推理[J]. 计算机工程, 2025, 51(11): 80-89.
[14]	李泽霖, 吕兆峰, 陈富强, 李克. 基于多跳信息融合的实体对齐模型[J]. 计算机工程, 2024, 50(9): 142-152.
[15]	汤志康, 武毓琦, 李春英, 汤庸. 基于知识图谱卷积网络的学习资源推荐[J]. 计算机工程, 2024, 50(9): 153-160.

选择文件类型/文献管理软件名称

选择包含的内容