作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2025, Vol. 51 ›› Issue (12): 43-55. doi: 10.19678/j.issn.1000-3428.0252059

• 热点与综述 • 上一篇    下一篇

一种面向大规模知识图谱的混合存储方案

游奕桁, 王鑫*(), 马梦露, 王惠   

  1. 天津大学智能与计算学部, 天津 300350
  • 收稿日期:2025-01-17 修回日期:2025-04-09 出版日期:2025-12-15 发布日期:2025-05-19
  • 通讯作者: 王鑫
  • 基金资助:
    国家自然科学基金面上项目(62472311)

A Hybrid Storage Scheme for Large-scale Knowledge Graphs

YOU Yiheng, WANG Xin*(), MA Menglu, WANG Hui   

  1. College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
  • Received:2025-01-17 Revised:2025-04-09 Online:2025-12-15 Published:2025-05-19
  • Contact: WANG Xin

摘要:

知识图谱作为人工智能领域的关键数据组织形式, 在大数据与大模型蓬勃发展的当下, 被广泛应用于众多领域。随着知识图谱规模不断扩大, 现有存储结构暴露出数据导入速度慢、存储空间占用大等问题。为此, 提出一种关系型+键值对的混合存储方案, 并设计基于属性频率的实体聚类算法。该方案借助基于属性频率的实体聚类算法, 对不同属性频率的实体簇进行分类。对于高频属性, 利用关系型数据库存储, 发挥其查询效率高的优势; 对于稀有属性, 采用键值对形式存储, 以展现键值对存储在处理稀疏数据时的灵活性。这种设计有效规避了关系型存储面对稀疏数据时产生大量空值的弊端, 减少了键值对存储中键的重复存储问题, 在确保数据灵活性的同时显著提升了存储效率。在合成数据集和真实数据集上的实验结果显示, 与现有方案相比, 该方案在真实数据集上存储空间节省50%以上, 数据导入速度提升1个数量级, 且查询效率保持不变。这充分说明了该方案有效地解决了大规模知识图谱的存储难题, 为知识图谱在各个领域的广泛应用提供了有力的存储支持, 具有重要的理论意义和实际应用价值。

关键词: 知识图谱, 资源描述框架图, 属性图, 关系型数据库, 数据存储

Abstract:

Knowledge graphs, a crucial form of data organization in the field of artificial intelligence, are widely applied across numerous domains with the increased development of big data and large-scale models. As the scale of knowledge graphs continues to expand, existing storage structures have encountered challenges such as slow data ingestion and excessive storage space occupation. To address these issues, this paper proposes a hybrid storage scheme based on relational+key-value and designs an entity clustering algorithm based on attribute frequency. This scheme utilizes an attribute-frequency-based clustering algorithm to classify entity clusters. By combining the proposed scheme and algorithm, high-frequency attributes are stored in a relational manner and rare attributes are stored in a key-value pair manner. This design effectively mitigates the drawbacks of relational storage (such as generating excessive NULL values when handling sparse data) while reducing key duplication issues inherent in key-value storage and significantly improves storage efficiency without compromising data flexibility. Experiments on synthetic and real-world datasets show that compared to existing schemes, the proposed scheme can save over 50% of storage space on real-world datasets, increases the data ingestion speed by an order of magnitude, and this scheme has no significant impact on query performance, thus effectively solving the storage challenges of large-scale knowledge graphs, providing strong storage support for the wide application of knowledge graphs across various fields, and having important theoretical significance and practical value.

Key words: knowledge graph, Resource Description Framework (RDF) graph, property graph, relational database, data storage