作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2024, Vol. 50 ›› Issue (1): 30-38. doi: 10.19678/j.issn.1000-3428.0066743

• 热点与综述 • 上一篇    下一篇

健壮且自适应的学习型近似查询处理方法研究

乔艺萌1,*(), 荆一楠2, 张寒冰2   

  1. 1. 复旦大学软件学院, 上海 200441
    2. 复旦大学计算机科学技术学院, 上海 200433
  • 收稿日期:2023-01-11 出版日期:2024-01-15 发布日期:2024-01-12
  • 通讯作者: 乔艺萌
  • 基金资助:
    国家自然科学基金(62072113)

Research on Robust and Adaptive Learned Approximate Query-Processing Method

Yimeng QIAO1,*(), Yinan JING2, Hanbing ZHANG2   

  1. 1. School of Software, Fudan University, Shanghai 200441, China
    2. School of Computer Science, Fudan University, Shanghai 200433, China
  • Received:2023-01-11 Online:2024-01-15 Published:2024-01-12
  • Contact: Yimeng QIAO

摘要:

由于在大规模数据集上执行精确查询耗时较长,因此近似查询处理(AQP)技术常被用于在线分析处理,目的是以较短的交互延迟返回查询结果,并尽可能地降低查询误差。现有的学习型AQP方法与底层数据解耦,将I/O密集型计算转化为CPU密集型计算,但是由于计算资源的限制,该类方法通常基于随机的数据样本进行模型训练,此类训练数据会引起稀有群组缺失问题,导致模型预测准确性不高。针对上述问题,提出一种基于分层样本学习的混合型和积网络模型,并基于该模型设计一种AQP框架。分层样本能够有效避免稀有群组缺失现象,基于该样本训练的模型预测准确性大幅提升。此外,针对数据动态更新的情况,提出一种模型自适应更新策略,使得模型能够及时检测数据偏移现象并自适应地执行更新。实验结果表明,与基于抽样和基于机器学习的AQP方法相比,该模型在真实数据集和合成数据集上的平均相对误差分别约降低18.3%和2.2%,在数据动态更新的场景下,其准确性和查询时延均呈现出良好的稳定性。

关键词: 近似查询处理, 和积网络, 分层抽样, 数据偏移, 自适应更新

Abstract:

Owing to the significant latency of exact queries on large-scale datasets, Approximate Query-Processing(AQP) techniques are typically applied to online analytical processing to return query results within interactive timescales with minimal error. The existing learning-based AQP methods decouple the underlying data and convert I/O-intensive calculations into CPU-intensive calculations. However, because of the limitations of computing resources, model training is typically performed based on random data samples.Such training data eliminate rare populations, thus resulting in unsatisfactory prediction accuracy by the model. Hence, this paper proposes a Stratified Sampling-based Sum-Product Network(SSSPN) model and designs an AQP framework based on the abovementioned model.Stratified samples can effectively avoid the elimination of rare populations and significantly improves the model accuracy. Additionally, in terms of dynamic data updates, this paper proposes an adaptive model-update strategy that allows the model to detect data shifts timely and automatically perform updates adaptively.Experimental results show that compared with the performance of AQP methods based on sampling and machine learning, the average relative errors of this model on real and synthetic datasets are approximately 18.3% and 2.2% lower, respectively; in scenarios where data are dynamically updated, both the accuracy and query latency of the model are favorable.

Key words: Approximate Query-Processing(AQP), Sum-Product Networks(SPN), stratified sampling, data shift, adaptive update