Cardinality Estimation Method for Multitable JOIN Query Optimization

doi:10.19678/j.issn.1000-3428.0061625

Abstract

Abstract: Cardinality estimation is one of the important means to optimize database multitable JOIN queries.When the cardinality of a data table with a large number of data is estimated, data sampling is often used to obtain smaller samples to estimate the data cardinality required under various query loads.The method of using data sampling to complete cardinality estimation on a single table has been widely studied;however, when there are restrictions on the overall storage budget of sampling samples in multiple data tables, an effective sample number division method between multiple tables to improve the overall cardinality estimation is lacking.Therefore, a cardinality estimation method for multitable JOIN query optimization is proposed.For a given set of query loads with complex multiple JOIN operations, the sampling rate of each table in the database is reasonably allocated to maximize the accuracy of cardinality estimation while meeting the limit of the sum of sample sizes.The above process is abstracted as a sampling rate allocation search problem, and the Bayesian optimization search algorithm is introduced into the database data sampling problem.This algorithm is used to search the allocation proportion of sampling sample size between different tables quickly, so that the cardinality estimation accuracy corresponding to the sample formula obtained in a limited time is the highest, thereby achieving query optimization.The experimental results on the TPC-H dataset show that, when determining the sampling proportion scheme with the highest cardinality estimation accuracy under the query load of multiple JOIN operations in the same time, compared with the random search algorithm, the cardinality estimation error rate corresponding to the scheme obtained by Bayesian optimization algorithm is reduced by 54.8% to 60.2%.

Key words: multitable JOIN, query optimization, cardinality estimation, data sampling, Bayesian optimization

摘要： 基数估计是实现数据库多表连接（JOIN）查询优化的重要手段之一。对数据量较大的数据表进行基数估计时常用数据抽样来获得较小的样本，从而估计各种查询负载下所需的数据基数。在单表上利用数据抽样来完成基数估计的方法已经得到广泛研究，但在多个数据表的抽样样本总体存储预算存在限制时，目前仍缺乏有效的多表间样本数划分方法使得整体基数估计达到较优。为此，提出一种面向多表JOIN查询优化的基数估计方法，针对一组给定的含有复杂多JOIN操作的查询负载，为其合理分配数据库中每个表的抽样率，从而在满足样本大小总和限制的同时使得基数估计准确率达到最高。将上述过程抽象为一个抽样率分配搜索问题，在数据库数据抽样问题中引入贝叶斯优化搜索算法，利用该算法快速搜索出不同表之间抽样样本大小的分配比例，使得有限时间内获得的样本分配方案对应的基数估计准确率最高，从而达到查询优化的目的。在TPC-H数据集上的实验结果表明，在相同时间内确定多JOIN操作查询负载下基数估计准确率最高的抽样比例方案时，相比随机搜索算法，贝叶斯优化算法所得方案对应的基数估计误差率降低54.8%~60.2%。

关键词: 多表连接, 查询优化, 基数估计, 数据抽样, 贝叶斯优化

CLC Number:

TP311

QIAN Wenyuan, JING Yinan, WANG Xiaoyang, WU Zhenhuan. Cardinality Estimation Method for Multitable JOIN Query Optimization[J]. Computer Engineering, 2022, 48(6): 167-173.

钱文渊, 荆一楠, 王晓阳, 吴振环. 面向多表连接查询优化的基数估计方法[J]. 计算机工程, 2022, 48(6): 167-173.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0061625

http://www.ecice06.com/EN/Y2022/V48/I6/167

Figures/Tables 4

References

[1] ZHOU X H, SUN J, LI G L, et al.Query performance prediction for concurrent queries using graph embedding[J].Proceedings of the VLDB Endowment, 2020, 13(9):1416-1428.
[2] LI G L, ZHOU X H, LI S F, et al.QTune:a query-aware database tuning system with deep reinforcement learning[J].Proceedings of the VLDB Endowment, 2019, 12(12):2118-2130.
[3] PAVLO A, ANGULO G, ARULRAJ J, et al.Self-driving database management systems[EB/OL].[2021-04-05].https://www.db.cs.cmu.edu/papers/2017/p42-pavlo-cidr17.pdf.
[4] CHAIKEN R, JENKINS B, LARSON P, et al.SCOPE:easy and efficient parallel processing of massive data sets[J].Proceedings of the VLDB Endowment, 2008, 1(2):1265-1276.
[5] 李国良, 周煊赫.面向AI的数据管理技术综述[J].软件学报, 2021, 32(1):21-40. LI G L, ZHOU X H.Survey of data management techniques for artificial intelligence[J].Journal of Software, 2021, 32(1):21-40.(in Chinese)
[6] 陈泽, 丁琳琳, 宋宝燕, 等.大规模动态图中概率游走约束的节点相似Top-k查询方法[J].计算机工程, 2021, 47(1):72-78, 86. CHEN Z, DING L L, SONG B Y, et al.Node similarity Top-k query method with probabilistic walk constraint in large-scale dynamic graphs[J].Computer Engineering, 2021, 47(1):72-78, 86.(in Chinese)
[7] LAN H, BAO Z F, PENG Y W.A survey on advancing the DBMS query optimizer:cardinality estimation, cost model, and plan enumeration[J].Data Science and Engineering, 2021, 6(1):86-101.
[8] CORMODE G.Synopses for massive data:samples, histograms, wavelets, sketches[J].Foundations and Trends in Databases, 2011, 4(1/2/3):1-294.
[9] VENGEROV D, MENCK A C, ZAIT M, et al.Join size estimation subject to filter conditions[J].Proceedings of the VLDB Endowment, 2015, 8(12):1530-1541.
[10] ESTAN C, NAUGHTON J F.End-biased samples for join cardinality estimation[C]//Proceedings of the 22nd International Conference on Data Engineering.Washington D.C., USA:IEEE Press, 2006:20-25.
[11] GANGULY S, GIBBONS P B, MATIAS Y, et al.Bifocal sampling for skew-resistant join size estimation[C]//Proceedings of 1996 ACM SIGMOD International Conference on Management of Data.New York, USA:ACM Press, 1996:17-22.
[12] CHEN Y, YI K.Two-level sampling for join size estimation[C]//Proceedings of 2017 ACM International Conference on Management of Data.New York, USA:ACM Press, 2017:123-134.
[13] LI F, WU B, YI K, et al.Wander join and XDB:online aggregation via random walks[J].ACM SIGMOD Record, 2017, 46(1):33-40.
[14] LIPTON R J, NAUGHTON J F.Query size estimation by adaptive sampling[J].Journal of Computer and System Sciences, 1995, 51(1):18-25.
[15] LOHMAN G.Is query optimization a "solved" problem?[EB/OL].[2021-04-05].https://wp.sigmod.org/?p=1075.
[16] LEIS V, GUBICHEV A, MIRCHEV A, et al.How good are query optimizers, really?[J].Proceedings of the VLDB Endowment, 2015, 9(3):204-215.
[17] MOCKUS J, TIESIS V, ZILINSKAS A.The application of Bayesian methods for seeking the extremum[J].Towards Global Optimisation, 1978, 2(2):117-129.
[18] BROCHU E, CORA V M, FREITAS N D.A tutorial on Bayesian optimization of expensive cost functions, with application to active user modeling and hierarchical reinforcement learning[EB/OL].[2021-04-05].http://haikufactory.com/files/bayopt.pdf.
[19] SNOEK J, LAROCHELLE H, ADAMS R P.Practical Bayesian optimization of machine learning algorithms[C]//Proceedings of Annual Conference on Neural Information Processing System.Cambridge, USA:MIT Press, 2012:2960-2968.
[20] JONES D, SCHONLAU M, WELCH W.Efficient global optimization of expensive black-box functions[J].Journal of Global Optimization, 1998, 13(4):455-492.
[21] RASMUSSEN C E, WILLIAMS C K I.Gaussian processes for machine learning[EB/OL].[2021-04-05].https://courses.cs.washington.edu/courses/cse591f/08au/GroupPapers/GPfML-Ch2.pdf.
[22] SHAHRIARI B, SWERSKY K, WANG Z Y, et al.Taking the human out of the loop:a review of Bayesian optimization[J].Proceedings of the IEEE, 2016, 104(1):148-175.
[23] LAN G J, TOMCZAK J M, ROIJERS D M, et al.Time efficiency in optimization with a Bayesian-Evolutionary algorithm[J].Swarm and Evolutionary Computation, 2022, 69:100970.

Please choose a citation manager

Content to export