基于数据仓库的典型图查询处理技术

doi:10.19678/j.issn.1000-3428.0065886

摘要/Abstract

摘要：

向量化查询等技术的成熟为基于数据仓库（数仓）实现图查询提供了契机，但现有系统没有考虑列式存储特点和图查询语句的特点，无法高效存储图数据及支持图查询优化。同时，由于需要保持原有图查询应用的兼容性，图查询Gremlin语言翻译生成的数仓SQL语言书写复杂且性能较差。针对上述问题，提出基于数仓的图数据库系统PandaGraph。在存储方面，PandaGraph基于关系模型高效存储图数据，结合数仓列式存储的特性进行主键和属性键设计，同时考虑到图查询和数仓查询执行特点，构建出入两张边表供图查询进行选择。在查询方面，PandaGraph结合不同Gremlin步骤的特点，构建关于遍历和存储表的查询结构，实现从Gremlin语言到SQL语言的翻译转化，使用多种优化规则对生成SQL语句进行改写，提高图查询性能。实验结果表明，PandaGraph在多场景下可正确进行翻译查询工作，并且在经典的低k跳查询场景下较现有专有图数据库系统获得5.8倍性能提升，在高k跳场景下可获得18.5倍性能提升，在基于规则的优化、基于表选择的优化和基于表结构的优化下PandaGraph可获得最少1.3、1.1和1.3倍的性能提升。

关键词: 数据库系统, 关系型数据库, 数据仓库, 图查询, 查询翻译

Abstract:

The maturity of technologies such as vectorization provides an opportunity to realize graph queries based on a data warehouse.However, the existing system does not consider the characteristics of columnar storage and queries and fails to efficiently store data and support query optimization and maintain the compatibility of the original graph query application.The Structured Query Language(SQL) of the data warehouse translated by the Gremlin graph query language is complex and has poor performance.To address these problems, PandaGraph, based on a data warehouse is proposed.For storage, PandaGraph efficiently stores graph data based on the relational model, designs primary and attribute keys by columnar storage, and considers the characteristics of graph query and data warehouse query execution by storing OUT and IN tables.For a query, PandaGraph uses Gremlin steps to construct a query structure for traversing and storage.It then translates the Gremlin language to SQL. Experiments show that PandaGraph can correctly translate queries in many cases and achieves a 5.8 times performance improvement in the classic low k-hop algorithm compared with the existing special graph database system.It achieves 18.5 times improvement in a high k-hop scenario. With rule-based optimization, table selection optimization, and table design optimization, PandaGraph can obtain at least 1.3, 1.1, and 1.3 times improvements, respectively.

Key words: database system, relational database, data warehouse, graph query, query translation

郭家鼎, 王鹏. 基于数据仓库的典型图查询处理技术[J]. 计算机工程, 2023, 49(9): 32-42.

Jiading GUO, Peng WANG. Graph Query Processing Technology Based on Data Warehouse[J]. Computer Engineering, 2023, 49(9): 32-42.

https://www.ecice06.com/CN/Y2023/V49/I9/32

图/表 16

图1 PandaGraph系统架构

Fig.1 PandaGraph system architecture

图2 顶点表存储

Fig.2 Vertex table storage

图3 边表存储

Fig.3 Edge table storage

图4 PandaGraph翻译过程

Fig.4 PandaGraph translation process

图5 数据导入时间对比

Fig.5 Comparison of data import time

图6 数据导入时间分解

Fig.6 Data import time breakdown

参考文献 25

1	ROBINSON I, WEBBER J, EIFREM E. Graph databases: new opportunities for connected data. Sebastopol, USA: O'Reilly, 2015.
2	刘宝珠, 王鑫, 柳鹏凯, 等. KGDB: 统一模型和语言的知识图谱数据库管理系统. 软件学报, 2021, 32 (3): 781- 804. URL
	LIU B Z, WANG X, LIU P K, et al. KGDB: knowledge graph database system with unified model and query language. Journal of Software, 2021, 32 (3): 781- 804. URL
3	王鑫, 邹磊, 王朝坤, 等. 知识图谱数据管理研究综述. 软件学报, 2019, 30 (7): 2139- 2174. URL
	WANG X, ZOU L, WANG C K, et al. Research on knowledge graph data management: a survey. Journal of Software, 2019, 30 (7): 2139- 2174. URL
4	饶志宏, 刘杰, 陈剑锋. 面向网络监测预警的海量知识存储研究. 计算机工程, 2018, 44 (3): 138- 143. URL
	RAO Z H, LIU J, CHEN J F. Research on massive knowledge storage for network monitoring and early warning. Computer Engineering, 2018, 44 (3): 138- 143. URL
5	CATTUTO C, QUAGGIOTTO M, PANISSON A, et al. Time-varying social networks in a graph database: a Neo4j use case[C]//Proceedings of the 1st International Workshop on Graph Data Management Experiences and Systems. New York, USA: ACM Press, 2013: 1-6.
6	DJIDJEV H, SANDINE G, STORLIE C, et al. Graph based statistical analysis of network traffic[C]//Proceedings of the 9th Workshop on Mining and Learning with Graphs. Washington D. C., USA: IEEE Press, 2011: 367-378.
7	AKOGLU L, TONG H H, KOUTRA D. Graph based anomaly detection and description: a survey. Data Mining and Knowledge Discovery, 2015, 29 (3): 626- 688. doi: 10.1007/s10618-014-0365-y
8	王健宗, 孔令炜, 黄章成, 等. 图神经网络综述. 计算机工程, 2021, 47 (4): 1- 12. URL
	WANG J Z, KONG L W, HUANG Z C, et al. Survey of graph neural network. Computer Engineering, 2021, 47 (4): 1- 12. URL
9	DAVOUDIAN A, CHEN L, LIU M C. A survey on NoSQL stores. ACM Computing Surveys, 2018, 51 (2): 22- 43.
10	COMYN-WATTIAU I, AKOKA J. Model driven reverse engineering of NoSQL property graph databases: the case of Neo4j[C]//Proceedings of IEEE International Conference on Big Data. Washington D. C., USA: IEEE Press, 2018: 453-458.
11	TAN K L, CAI Q C, OOI B C, et al. In-memory databases. ACM SIGMOD Record, 2015, 44 (2): 35- 40.
12	POLYCHRONIOU O, RAGHAVAN A, ROSS K A. Rethinking SIMD vectorization for in-memory databases[C]//Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 1493-1508.
13	CHANG L, WANG Z W, MA T, et al. HAWQ: a massively parallel processing SQL engine in hadoop[C]//Proceedings of 2014 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2014: 1223-1234.
14	ANGLES R, ARENAS M, BARCELÓ P, et al. Foundations of modern query languages for graph databases. ACM Computing Surveys, 2017, 50 (5): 68- 79.
15	TITAN. Distributed graph database[EB/OL]. [2022-08-10]. https://titan.thinkaurelius.com.
16	HugeGraph[EB/OL]. [2022-08-10]. https://HugeGraph.apache.com.
17	Neo4j[EB/OL]. [2022-08-10]. https://neo4j.com.
18	TIAN Y Y, XU E L, ZHAO W, et al. IBM Db2 graph: supporting synergistic and retrofittable graph queries inside IBM Db2[C]//Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2020: 345-359.
19	SUN W, FOKOUE A, SRINIVAS K, et al. SQLGraph: an efficient relational-based property graph store[C]//Proceedings of 2015 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2015: 1887-1901.
20	RUDOLF M, PARADIES M, BORNHÖVD C, et al. The graph story of the sap HAND database[EB/OL]. [2022-08-10]. https://www.researchgate.net/profile/Michael-Rudolf/publication/236178288.
21	RODRIGUEZ M A. The Gremlin graph traversal machine and language (invited talk)[C]//Proceedings of the 15th Symposium on Database Programming Languages. New York, USA: ACM Press, 2015: 1-10.
22	STEER B A, ALNAIMI A, LOTZ M A B F G, et al. Cytosm: declarative property graph queries without data migration[C]//Proceedings of the 5th International Workshop on Graph Data-management Experiences & Systems. New York, USA: ACM Press, 2017: 1-6.
23	TINKERPOP A. Apache tinkerpop[EB/OL]. [2022-08-10]. https://tinkerpop.apache.org/.
24	O'NEIL P, CHENG E, GAWLICK D, et al. The Log-Structured Merge-tree(LSM-tree). Acta Informatica, 1996, 33 (4): 351- 385.
25	DONG S, CALLAGHAN M, GALANIS L, et al. Optimizing space amplification in RocksDB[C]// Proceedings of CIDRʼ17. Washington D. C., USA: IEEE Press, 2017: 256-268.

[1]	陆慧琳,黄博. 基于双索引的子图查询算法[J]. 计算机工程, 2015, 41(1): 44-48.
[2]	潘郑冰,戴牡红. 实时数据仓库中一种改进的数据流更新算法[J]. 计算机工程, 2014, 40(10): 43-46,51.
[3]	陈佳, 李敏. 用于多维数据实视图选择的分布估计算法[J]. 计算机工程, 2012, 38(11): 45-47.
[4]	姚林, 张永库. NoSQL的分布式存储与扩展解决方法[J]. 计算机工程, 2012, 38(06): 40-42.
[5]	赵鹏, 王守军, 龚云. 基于改进蚁群算法的数据仓库多连接查询优化[J]. 计算机工程, 2012, 38(01): 168-170,173.
[6]	梁银. 基于聚类方法的空间度量物化选择算法[J]. 计算机工程, 2011, 37(8): 58-60.
[7]	解春欣, 汪卫. 子图同构验证算法OES[J]. 计算机工程, 2011, 37(3): 30-32.
[8]	游进国, 杨卓荦, 胡建华, 奚建清. 一种支持大规模数据的多维可视化分析框架[J]. 计算机工程, 2011, 37(19): 26-27,31.
[9]	黄晓森, 彭利宁, 陈启买. 基于数据立方体的动态推理控制方法[J]. 计算机工程, 2011, 37(17): 32-34,37.
[10]	沈学利, 钟华. 决策树与数据仓库结合的研究与应用[J]. 计算机工程, 2011, 37(11): 89-91.
[11]	王坤, 王锦. 基于NFS V4协议的关系型数据库部署[J]. 计算机工程, 2011, 37(01): 72-74.
[12]	夏家莉, 韩增波, 陈辉. 基于功能替代模型的无冲突并发控制协议[J]. 计算机工程, 2010, 36(15): 57-59.
[13]	杨雅婷, 马博, 苏国平, 蒋同海, 李晓. 区域信息化水平评价方法研究[J]. 计算机工程, 2010, 36(13): 272-275.
[14]	张应龙, 盛立琨, 杨乐. 压缩数据集上的快速Cube计算方法[J]. 计算机工程, 2010, 36(12): 85-87.
[15]	周海晴;陈启买;刘海. 基于数据立方体的数据仓库安全控制[J]. 计算机工程, 2010, 36(10): 152-154.

选择文件类型/文献管理软件名称

选择包含的内容