作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (9): 32-42. doi: 10.19678/j.issn.1000-3428.0065886

• 热点与综述 • 上一篇    下一篇

基于数据仓库的典型图查询处理技术

郭家鼎1, 王鹏2   

  1. 1. 复旦大学 软件学院, 上海 200438
    2. 复旦大学 计算机科学技术学院, 上海 200438
  • 收稿日期:2022-09-30 出版日期:2023-09-15 发布日期:2023-09-14
  • 作者简介:

    郭家鼎(1997—),男,硕士研究生,主研方向为图数据库系统

    王鹏,教授、博士

  • 基金资助:
    国家重点研发计划(2020YFB1710001)

Graph Query Processing Technology Based on Data Warehouse

Jiading GUO1, Peng WANG2   

  1. 1. School of Software, Fudan University, Shanghai 200438, China
    2. School of Computer Science, Fudan University, Shanghai 200438, China
  • Received:2022-09-30 Online:2023-09-15 Published:2023-09-14

摘要:

向量化查询等技术的成熟为基于数据仓库(数仓)实现图查询提供了契机,但现有系统没有考虑列式存储特点和图查询语句的特点,无法高效存储图数据及支持图查询优化。同时,由于需要保持原有图查询应用的兼容性,图查询Gremlin语言翻译生成的数仓SQL语言书写复杂且性能较差。针对上述问题,提出基于数仓的图数据库系统PandaGraph。在存储方面,PandaGraph基于关系模型高效存储图数据,结合数仓列式存储的特性进行主键和属性键设计,同时考虑到图查询和数仓查询执行特点,构建出入两张边表供图查询进行选择。在查询方面,PandaGraph结合不同Gremlin步骤的特点,构建关于遍历和存储表的查询结构,实现从Gremlin语言到SQL语言的翻译转化,使用多种优化规则对生成SQL语句进行改写,提高图查询性能。实验结果表明,PandaGraph在多场景下可正确进行翻译查询工作,并且在经典的低k跳查询场景下较现有专有图数据库系统获得5.8倍性能提升,在高k跳场景下可获得18.5倍性能提升,在基于规则的优化、基于表选择的优化和基于表结构的优化下PandaGraph可获得最少1.3、1.1和1.3倍的性能提升。

关键词: 数据库系统, 关系型数据库, 数据仓库, 图查询, 查询翻译

Abstract:

The maturity of technologies such as vectorization provides an opportunity to realize graph queries based on a data warehouse.However, the existing system does not consider the characteristics of columnar storage and queries and fails to efficiently store data and support query optimization and maintain the compatibility of the original graph query application.The Structured Query Language(SQL) of the data warehouse translated by the Gremlin graph query language is complex and has poor performance.To address these problems, PandaGraph, based on a data warehouse is proposed.For storage, PandaGraph efficiently stores graph data based on the relational model, designs primary and attribute keys by columnar storage, and considers the characteristics of graph query and data warehouse query execution by storing OUT and IN tables.For a query, PandaGraph uses Gremlin steps to construct a query structure for traversing and storage.It then translates the Gremlin language to SQL. Experiments show that PandaGraph can correctly translate queries in many cases and achieves a 5.8 times performance improvement in the classic low k-hop algorithm compared with the existing special graph database system.It achieves 18.5 times improvement in a high k-hop scenario. With rule-based optimization, table selection optimization, and table design optimization, PandaGraph can obtain at least 1.3, 1.1, and 1.3 times improvements, respectively.

Key words: database system, relational database, data warehouse, graph query, query translation