作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (6): 167-173. doi: 10.19678/j.issn.1000-3428.0061625

• 先进计算与数据处理 • 上一篇    下一篇

面向多表连接查询优化的基数估计方法

钱文渊1, 荆一楠1, 王晓阳1, 吴振环2   

  1. 1. 复旦大学 计算机科学技术学院, 上海 200433;
    2. 司法部 信息中心, 北京 100020
  • 收稿日期:2021-05-12 修回日期:2021-07-19 发布日期:2021-07-21
  • 作者简介:钱文渊(1995—),男,硕士研究生,主研方向为数据库技术;荆一楠,副教授、博士;王晓阳,教授、博士;吴振环,学士。
  • 基金资助:
    国家自然科学基金(61732004)。

Cardinality Estimation Method for Multitable JOIN Query Optimization

QIAN Wenyuan1, JING Yinan1, WANG Xiaoyang1, WU Zhenhuan2   

  1. 1. School of Computer Science, Fudan University, Shanghai 200433, China;
    2. Information Center, Ministry of Justice, Beijing 100020, China
  • Received:2021-05-12 Revised:2021-07-19 Published:2021-07-21

摘要: 基数估计是实现数据库多表连接(JOIN)查询优化的重要手段之一。对数据量较大的数据表进行基数估计时常用数据抽样来获得较小的样本,从而估计各种查询负载下所需的数据基数。在单表上利用数据抽样来完成基数估计的方法已经得到广泛研究,但在多个数据表的抽样样本总体存储预算存在限制时,目前仍缺乏有效的多表间样本数划分方法使得整体基数估计达到较优。为此,提出一种面向多表JOIN查询优化的基数估计方法,针对一组给定的含有复杂多JOIN操作的查询负载,为其合理分配数据库中每个表的抽样率,从而在满足样本大小总和限制的同时使得基数估计准确率达到最高。将上述过程抽象为一个抽样率分配搜索问题,在数据库数据抽样问题中引入贝叶斯优化搜索算法,利用该算法快速搜索出不同表之间抽样样本大小的分配比例,使得有限时间内获得的样本分配方案对应的基数估计准确率最高,从而达到查询优化的目的。在TPC-H数据集上的实验结果表明,在相同时间内确定多JOIN操作查询负载下基数估计准确率最高的抽样比例方案时,相比随机搜索算法,贝叶斯优化算法所得方案对应的基数估计误差率降低54.8%~60.2%。

关键词: 多表连接, 查询优化, 基数估计, 数据抽样, 贝叶斯优化

Abstract: Cardinality estimation is one of the important means to optimize database multitable JOIN queries.When the cardinality of a data table with a large number of data is estimated, data sampling is often used to obtain smaller samples to estimate the data cardinality required under various query loads.The method of using data sampling to complete cardinality estimation on a single table has been widely studied;however, when there are restrictions on the overall storage budget of sampling samples in multiple data tables, an effective sample number division method between multiple tables to improve the overall cardinality estimation is lacking.Therefore, a cardinality estimation method for multitable JOIN query optimization is proposed.For a given set of query loads with complex multiple JOIN operations, the sampling rate of each table in the database is reasonably allocated to maximize the accuracy of cardinality estimation while meeting the limit of the sum of sample sizes.The above process is abstracted as a sampling rate allocation search problem, and the Bayesian optimization search algorithm is introduced into the database data sampling problem.This algorithm is used to search the allocation proportion of sampling sample size between different tables quickly, so that the cardinality estimation accuracy corresponding to the sample formula obtained in a limited time is the highest, thereby achieving query optimization.The experimental results on the TPC-H dataset show that, when determining the sampling proportion scheme with the highest cardinality estimation accuracy under the query load of multiple JOIN operations in the same time, compared with the random search algorithm, the cardinality estimation error rate corresponding to the scheme obtained by Bayesian optimization algorithm is reduced by 54.8% to 60.2%.

Key words: multitable JOIN, query optimization, cardinality estimation, data sampling, Bayesian optimization

中图分类号: