作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2022, Vol. 48 ›› Issue (9): 28-36. doi: 10.19678/j.issn.1000-3428.0064406

• 热点与综述 • 上一篇    下一篇

参数自适应下基于近邻图的近似最近邻搜索

甘红楠1, 张凯2   

  1. 1. 复旦大学 软件学院, 上海 200438;
    2. 复旦大学 计算机科学技术学院, 上海 200438
  • 收稿日期:2022-04-08 修回日期:2022-05-19 发布日期:2022-06-20
  • 作者简介:甘红楠(1996—),男,硕士研究生,主研方向为近似最近邻搜索;张凯,副教授、博士。
  • 基金资助:
    国家自然科学基金(62072113)。

Approximate Nearest Neighbor Search Based on Neighbor Graphs with Parameter Adaptation

GAN Hongnan1, ZHANG Kai2   

  1. 1. School of Software, Fudan University, Shanghai 200438, China;
    2. School of Computer Science, Fudan University, Shanghai 200438, China
  • Received:2022-04-08 Revised:2022-05-19 Published:2022-06-20

摘要: 现有基于近邻图的近似最近邻搜索(ANNS)算法通常将数据库中被检索向量组织成近邻图结构,根据用户设定参数搜索查询向量的近似最近邻。为提升基于近邻图的ANNS算法在给定召回率下的搜索效率,提出一种参数自适应方法AdaptNNS。采集数据库中的被检索向量并对采样结果进行聚类,利用聚类中心向量和最近邻分类器提取查询负载特征,同时将查询负载特征与不同的召回率相结合作为输入特征训练梯度提升决策树(GBDT)模型。在查询处理过程中,根据应用程序指定的召回率获取最终输入特征,并通过GBDT模型预测最优搜索参数,提升ANNS算法的吞吐量。在Text-to-Image、DEEP和Turing-ANNS数据集上的实验结果表明,当达到相同的目标召回率时,AdaptNNS方法相比于Baseline方法最多可将DiskANN和HNSW算法的吞吐量提升1.3倍,具有更高的近似最近邻搜索效率。

关键词: 近似最近邻搜索, 近邻图, 参数自适应, 聚类, 梯度提升决策树

Abstract: Approximate Nearest Neighbor Search(ANNS) algorithms based on neighbor graphs typically organize vectors in a database into a neighbor graph structure and obtain the Approximate Nearest Neighbor(ANN) of the query vector by leveraging user-specified search parameter configurations.An adaptive method named AdaptNNS is proposed to improve the search efficiency of ANNS algorithms based on neighbor graphs given specific recall rate requirements.First, AdaptNNS samples vectors in the database and clusters the sampling results.Second, AdaptNNS uses centroids as nearest neighbor classifiers to extract query load features.Finally, AdaptNNS concatenates different recall rate targets and features from the query load to create a model input, which it then uses to train a Gradient Boosting Decision Tree(GBDT) model.During the query processing phase, AdaptNNS obtains input features of the trained model with queries and specified recall rate values and improves the throughput of the ANNS by predicting optimal search parameters. Experiments are performed on Text-to-Image, DEEP, and Turing-ANNS datasets using DiskANN and HNSW algorithms.The results show that AdaptNNS can increase the throughput by a maximum of 1.3 times as compared with the Baseline method when the same target recall rate is reached.Thus, AdaptNNS searches ANNs more efficiently.

Key words: Approximate Nearest Neighbor Search(ANNS), neighbor graph, parameter adaptation, clustering, Gradient Boosting Decision Tree(GBDT)

中图分类号: