Research on Robust and Adaptive Learned Approximate Query-Processing Method

doi:10.19678/j.issn.1000-3428.0066743

Abstract

Abstract:

Owing to the significant latency of exact queries on large-scale datasets, Approximate Query-Processing(AQP) techniques are typically applied to online analytical processing to return query results within interactive timescales with minimal error. The existing learning-based AQP methods decouple the underlying data and convert I/O-intensive calculations into CPU-intensive calculations. However, because of the limitations of computing resources, model training is typically performed based on random data samples.Such training data eliminate rare populations, thus resulting in unsatisfactory prediction accuracy by the model. Hence, this paper proposes a Stratified Sampling-based Sum-Product Network(SSSPN) model and designs an AQP framework based on the abovementioned model.Stratified samples can effectively avoid the elimination of rare populations and significantly improves the model accuracy. Additionally, in terms of dynamic data updates, this paper proposes an adaptive model-update strategy that allows the model to detect data shifts timely and automatically perform updates adaptively.Experimental results show that compared with the performance of AQP methods based on sampling and machine learning, the average relative errors of this model on real and synthetic datasets are approximately 18.3% and 2.2% lower, respectively; in scenarios where data are dynamically updated, both the accuracy and query latency of the model are favorable.

Key words: Approximate Query-Processing(AQP), Sum-Product Networks(SPN), stratified sampling, data shift, adaptive update

摘要：

由于在大规模数据集上执行精确查询耗时较长，因此近似查询处理（AQP）技术常被用于在线分析处理，目的是以较短的交互延迟返回查询结果，并尽可能地降低查询误差。现有的学习型AQP方法与底层数据解耦，将I/O密集型计算转化为CPU密集型计算，但是由于计算资源的限制，该类方法通常基于随机的数据样本进行模型训练，此类训练数据会引起稀有群组缺失问题，导致模型预测准确性不高。针对上述问题，提出一种基于分层样本学习的混合型和积网络模型，并基于该模型设计一种AQP框架。分层样本能够有效避免稀有群组缺失现象，基于该样本训练的模型预测准确性大幅提升。此外，针对数据动态更新的情况，提出一种模型自适应更新策略，使得模型能够及时检测数据偏移现象并自适应地执行更新。实验结果表明，与基于抽样和基于机器学习的AQP方法相比，该模型在真实数据集和合成数据集上的平均相对误差分别约降低18.3%和2.2%，在数据动态更新的场景下，其准确性和查询时延均呈现出良好的稳定性。

关键词: 近似查询处理, 和积网络, 分层抽样, 数据偏移, 自适应更新

Yimeng QIAO, Yinan JING, Hanbing ZHANG. Research on Robust and Adaptive Learned Approximate Query-Processing Method[J]. Computer Engineering, 2024, 50(1): 30-38.

乔艺萌, 荆一楠, 张寒冰. 健壮且自适应的学习型近似查询处理方法研究[J]. 计算机工程, 2024, 50(1): 30-38.

/ / Recommend / Download Citations

URL: http://www.ecice06.com/EN/10.19678/j.issn.1000-3428.0066743

http://www.ecice06.com/EN/Y2024/V50/I1/30

Figures/Tables 14

Fig.1 Example of MSPN

Fig.2 SSSPN constructed based on stratified samples of Flights dataset

Fig.3 Approximate query-processing framework based on SSSPN

Fig.4 Prediction accuracy on Flights and SSB datasets

Fig.5 Average query latency on Flights and SSB datasets

Fig.6 Changes in query accuracy during model adaptive update process

Fig.7 Changes in query latency during model adaptive update process

Fig.8 Comparison of query accuracy under different model numbers

Fig.9 Comparison of average query latency under different model numbers

References 26

1	LIU Z C, HEER J. The effects of interactive latency on exploratory visual analysis. IEEE Transactions on Visualization and Computer Graphics, 2014, 20 (12): 2122- 2131. doi: 10.1109/TVCG.2014.2346452
2	AGARWAL S, MOZAFARI B, PANDA A, et al. BlinkDB: queries with bounded errors and bounded response times on very large data[C]//Proceedings of the 8th ACM European Conference on Computer Systems. New York, USA: ACM Press, 2013: 29-42.
3	PARK Y, MOZAFARI B, SORENSON J, et al. VerdictDB: universalizing approximate query processing[C]//Proceedings of 2018 International Conference on Management of Data. New York, USA: ACM Press, 2018: 1461-1476.
4	DING B L, HUANG S L, CHAUDHURI S, et al. Sample+ seek: approximating aggregates with distribution precision guarantee[C]//Proceedings of 2016 International Conference on Management of Data. New York, USA: ACM Press, 2016: 679-694.
5	MA Q Z, TRIANTAFILLOU P. DBEst: revisiting approximate query processing engines with machine learning models[C]//Proceedings of 2019 International Conference on Management of Data. New York, USA: ACM Press, 2019: 1553-1570.
6	SHEORAN N, MITRA S, PORWAL V, et al. Conditional generative model based predicate-aware query approximation. Proceedings of the AAAI Conference on Artificial Intelligence, 2022, 36 (8): 8259- 8266. doi: 10.1609/aaai.v36i8.20800
7	THIRUMURUGANATHAN S, HASAN S, KOUDAS N, et al. Approximate query processing for data exploration using deep generative models[C]//Proceedings of 2020 IEEE International Conference on Data Engineering. Washington D. C., USA: IEEE Press, 2020: 1309-1320.
8	MA Q, SHANGHOOSHABAD A M, ALMASI M, et al. Learned approximate query processing: make it light, accurate and fast[EB/OL]. [2022-12-05]. https://www.cidrdb.org/cidr2021/papers/cidr2021_paper15.pdf.
9	白文超, 韩希先, 王金宝. 基于条件生成模型的高效近似查询处理框架. 浙江大学学报(工学版), 2022, 56 (5): 995- 1005.
	BAI W C, HAN X X, WANG J B. Efficient approximate query processing framework based on conditional generative model. Journal of Zhejiang University (Engineering Science), 2022, 56 (5): 995- 1005.
10	Flights dataset[EB/OL]. [2022-12-05]. https://github.com/IDEBench/IDEBench-public/blob/master/data/flights.zip.Accessed:2021-12-06.
11	POON H, DOMINGOS P. Sum-product networks: a new deep architecture[C]//Proceedings of 2011 IEEE International Conference on Computer Vision Workshops. Washington D. C., USA: IEEE Press, 2011: 689-690.
12	SANCHEZ-CAUCE R, PARIS I, DIEZ F J. Sum-product networks: a survey[EB/OL]. [2022-12-05]. https://arxiv.org/abs/2004.01167.
13	MOLINA A, VERGARI A, DI MAURO N, et al. Mixed sum-product networks: a deep architecture for hybrid domains. Proceedings of the AAAI Conference on Artificial Intelligence, 2018, 32 (1): 3829- 3835.
14	CHOI Y, VERGARI A, VAN DEN BROECK G. Probabilistic circuits: a unifying framework for tractable probabilistic models[EB/OL]. [2022-12-05]. http://starai.cs.ucla.edu/papers/ProbCirc20.pdf.
15	SHARMA G. Pros and cons of different sampling techniques. International Journal of Applied Research, 2017, 3 (7): 749- 752.
16	KRISHNA K, NARASIMHA MURTY M. Genetic K-Means algorithm. IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), 1999, 29 (3): 433- 439. doi: 10.1109/3477.764879
17	LOPEZ-PAZ D, HENNIG P, SCHÖLKOPF B. The randomized dependence coefficient[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems. New York, USA: ACM Press, 2013: 1-9.
18	ACHARYA S, GIBBONS P B, POOSALA V, et al. The AQUA approximate query answering system[C]//Proceedings of 1999 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 1999: 574-576.
19	WANG X Y, QU C B, WU W Y, et al. Are we ready for learned cardinality estimation?. Proceedings of the VLDB Endowment, 2021, 14 (9): 1640- 1654. doi: 10.14778/3461535.3461552
20	LI B B, LU Y, KANDULA S. Warper: efficiently adapting learned cardinality estimators to data and workload drifts[C]//Proceedings of 2022 International Conference on Management of Data. New York, USA: ACM Press, 2022: 1920-1933.
21	Dataset shift in machine learning[EB/OL]. [2022-12-05]. http://www.acad.bg/ebook/ml/The.MIT.Press.Dataset.Shift.in.Machine.Learning.Feb.2009.eBook-DDU.pdf.
22	MASSEY JR F J. The Kolmogorov-Smirnov test for goodness of fit. Journal of the American Statistical Association, 1951, 46 (253): 68- 78. doi: 10.1080/01621459.1951.10500769
23	MOLINA A, VERGARI A, STELZNER K, et al. SPFlow: an easy and extensible library for deep probabilistic learning using sum-product networks[EB/OL]. [2022-12-05]. https://arxiv.org/abs/1901.03704.pdf.
24	EICHMANN P, ZGRAGGEN E, BINNIG C, et al. IDEBench: a benchmark for interactive data exploration[C]//Proceedings of 2020 ACM SIGMOD International Conference on Management of Data. New York, USA: ACM Press, 2020: 1555-1569.
25	O'NEIL P E, O'NEIL E J, CHEN X. The Star Schema Benchmark (SSB) [EB/OL]. [2022-12-05]. https://www.researchgate.net/publication/250061595_The_Star_Schema_Benchmark_SSB.
26	HILPRECHT B, SCHMIDT A, KULESSA M, et al. DeepDB. Proceedings of the VLDB Endowment, 2020, 13 (7): 992- 1005. doi: 10.14778/3384345.3384349

[1]	CHEN Ze, DING Linlin, SONG Baoyan, WANG Junlu. Node Similarity Top-k Query Method with Probabilistic Walk Constraint in Large-Scale Dynamic Graphs [J]. Computer Engineering, 2021, 47(1): 72-78,86.
[2]	WANG Su-Na, LI Yi-Hai, LUO Xin-Guo. Anomalous Attack Traffic Detection Based on Stratified Sampling Algorithm [J]. Computer Engineering, 2012, 38(12): 105-109.
[3]	ZHOU Zikang; YANG Heng; TANG Wansheng. Probability Criterion Model for Portfolio Selection 　　　　　　　and Its Solution Using GASS II [J]. Computer Engineering, 2006, 32(19): 185-187.

Please choose a citation manager

Content to export