作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2023, Vol. 49 ›› Issue (11): 284-292, 301. doi: 10.19678/j.issn.1000-3428.0066222

• 开发研究与工程应用 • 上一篇    下一篇

基于深度生成模型的聚合查询区间估计方法

房俊1,2, 薛晓东1,2, 周云亮1,2   

  1. 1. 北方工业大学 信息学院, 北京 100144
    2. 大规模流数据集成与分析技术北京市重点实验室, 北京 100144
  • 收稿日期:2022-11-10 出版日期:2023-11-15 发布日期:2023-02-08
  • 作者简介:

    房俊(1976—),男,副研究员、博士,主研方向为大数据管理、分布式数据处理

    薛晓东,硕士研究生

    周云亮,硕士研究生

  • 基金资助:
    国家自然科学基金国际(地区)合作与交流项目(62061136006)

Aggregated Query Interval Estimation Method Based on Depth Generative Model

Jun FANG1,2, Xiaodong XUE1,2, Yunliang ZHOU1,2   

  1. 1. School of Information, North China University of Technology, Beijing 100144, China
    2. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data, Beijing 100144, China
  • Received:2022-11-10 Online:2023-11-15 Published:2023-02-08

摘要:

目前大多数近似查询方法都是用一个估计值来回答查询,这种点估计的方法虽然简单但是会存在误差。区间估计方法需要在大量样本上完成计算,会造成较高的查询时延,导致在实际中难以广泛应用。以模型驱动的近似查询技术虽在效率上有一定优势,但其查询结果缺乏可靠性保障。为此,提出一种融合数据抽样和机器学习算法的近似查询方法,通过深度生成模型提高查询效率,用区间估计代替点估计来回答查询,即通过多个样本的查询结果来生成一个相对可靠的区间结果。首先利用改进的生成对抗网络模型学习数据分布,在不访问数据集的情况下快速生成多个样本,然后利用大规模并行处理架构来分配计算任务,完成样本生成和查询执行的过程,最后将查询结果返回给用户。实验结果表明,该方法得出的聚合查询区间估计结果的归一化置信区间覆盖率(NCIC)达到85%以上,在聚合函数为COUNT且选择性低于0.03的查询实验中,针对ROAD、PM2.5这2个数据集,该方法的NCIC较随机抽样方法分别提高了13.9%和14.8%,虽然其查询时延相较基准方法有所增加,但是也可满足常规应用要求。

关键词: 近似查询, 生成模型, 并行计算, 区间估计, 抽样

Abstract:

Currently, most approximate query methods use estimation to answer a query. Although this type of point estimation is simple, it consistently produces errors. Because it must complete calculation on the basis of a large number of samples, the interval estimation method causes high query delay and is difficult to apply in practice.Although the model-driven approximate query technique has advantages in terms of efficiency, its query results lack reliability. To address this challenge, an approximate query method combining data sampling and machine learning algorithms is proposed herein. The depth generation model is used to improve query efficiency, and instead of point estimation, interval estimation is used to answer the query. Thus, a relatively reliable interval result is generated through multiple sample query results. First, the improved Generative Adversarial Network (GAN) model is used to learn the data distribution, and subsequently, multiple samples are rapidly generated without accessing the dataset.The massive parallel processing architecture is used to assign computing tasks, complete the sample generation and query execution processes, and finally the query results are obtained. Experimental results demonstrate that the Normalized Confidence Interval Coverage(NCIC) of the aggregate query interval estimated results obtained by the proposed method is over 85%. In a query experiment with the aggregate function COUNT and selectivity lower than 0.03, for ROAD and PM2.5 datasets, the NCIC for this method is 13.9% and 14.8% higher, respectively, than the random sampling method. Although the query delay increases compared with the benchmark method, it was confirmed that the proposed solution meets common application requirements.

Key words: approximate query, generative model, parallel computing, interval estimation, sampling