基于索引的内存相似性连接算法

doi:10.3969/j.issn.1000-3428.2016.01.004

计算机工程

基于索引的内存相似性连接算法

董明秀^a,b,王鹏^a,b,汪洋^a,b,李秋虹^a,b,汪卫^a,b

(复旦大学 a.计算机科学技术学院; b.上海市数据科学重点实验室,上海 201203)

收稿日期:2014-12-19 出版日期:2016-01-15 发布日期:2016-01-15
作者简介:董明秀(1988-),女,硕士研究生,主研方向为时间序列相关性查询、分布式计算;王鹏,副教授、博士研究生;汪洋、李秋虹,博士研究生;汪卫,教授、博士研究生。
基金资助:
国家自然科学基金资助项目(61103009);上海市科委大数据专项基金资助项目(13511504800)。

Memory Similarity Join Algorithm Based on Index

DONG Mingxiu ^a,b,WANG Peng ^a,b,WANG Yang ^a,b,LI Qiuhong ^a,b,WANG Wei^a,b

(a.School of Computer Science; b.Shanghai Key Laboratory of Data Science,Fudan University,Shanghai 201203,China)

Received:2014-12-19 Online:2016-01-15 Published:2016-01-15

摘要/Abstract

摘要： 在传统的相似性连接算法中,精确计算和分区阶段互相独立,精确计算时需要对每个分区中的所有数据进行两两比较,计算量较大。针对该问题,设计一种新的内存索引——距离树,并在其基础上提出两结构内存相似性连接算法。根据数据的潜在分布将其分发到不同的分区中 ,保证具有一定相似度的数据对分配在同个或相邻的分区内,同时通过树节点之间的位置信息保存分区阶段的计算结果,使精确计算阶段仅需对每个分区中相邻的叶节点数据进行比较计算。实验结果表明,与TOUCH算法相比,基于距离树的算法可使运行速度提高2倍~3倍,并具有更好的可扩展性。

关键词: 相似性连接, 磁盘, 查询, 内存, 索引, 分区

Abstract: In traditional similarity join algorithms,data partition and refined calculation are isolated.During the refined calculation phase,all pairs of data in the same partition need to be compared with each other which leads to a large number of comparison computations.In order to solve this problem,this paper designs a new memory index:DistanceTree,and proposes an in-memory similarity join algorithm based on it.This algorithm distributes data into different partitions according to the potential distribution of data,ensures the data with same similarity to the same or adjacent partitions,and saves the calculation results of partition phase through the tree node location information.By leveraging the calculation result,only pairs of data in the same or adjacent leaf nodes need to be compared.Experimental results show that similarity join algorithm based on DistanceTree is 2 times~3 times more efficient than TOUCH algorithm and also is more scalable.

中图分类号:

TP391

董明秀,王鹏,汪洋,李秋虹,汪卫. 基于索引的内存相似性连接算法[J]. 计算机工程.

DONG Mingxiu,WANG Peng,WANG Yang,LI Qiuhong,WANG Wei. Memory Similarity Join Algorithm Based on Index[J]. Computer Engineering.

https://www.ecice06.com/CN/Y2016/V42/I1/18

参考文献

参考文献［1］Ubell M.The Montage Extensible DataBlade Achite-cture［C］//Proceedings of ACM SIGMOD International Conference on Management of Data.Minneapolis,USA:ACM Press,1994:482-493. ［2］Wang Fusheng.A Data Model and Database for High-resolution Pathology Analytical Image Informatics［J］.Journal of Pathology Informatics,2011,2(1):32-40. ［3］Henzinger M R.Finding Near-duplicate Web Pages:A Large-scale Evaluation of Algorithms［C］//Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval.Seattle,USA:ACM Press,2006:284-191. ［4］Hoad T C.Methods for Identifying Versioned and Plagiarized Documents［J］.Journal of the American Society for Information Science and Technology,2003,54(3):203-215. (下转第30页) (上接第24页) ［5］Nobari S,Tauheed F,Heinis T.TOUCH:In-memory Spatial Join by Hierarchical Data-oriented Partitioning［C］//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,2013:701-712. ［6］Patel J M,DeWitt D J.Partition Based Spatial-merge Join［C］//Proceedings of ACM SIGMOD International Conference on Management of Data.New York,USA:ACM Press,1996:259-270. ［7］Ye Wang,Metwally A,Parthasarathy S.Scalable All-pairs Similarity Search in Metric Spaces［C］//Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.New York,USA:ACM Press,2013:829-837. ［8］王晓晔.时间序列数据挖掘中相似性和趋势预测的研究［D］.天津:天津大学,2003. ［9］Guttman A.R-trees:A Dynamic Index Structure for Spatial Searching［C］//Proceedings of ACM SIGKDD Inter-national Conference on Management of Data.New York,USA:ACM Press,1984:47-57. ［10］Bryant V.Metric Spaces:Iteration and Application［M］.London,UK:Cambridge University Press,1985. ［11］Toussaint G T.A Simple Linear Algorithm for Intersecting Convex Polygons［J］.The Visual Computer,1985,1(2):118-123. ［12］Jolliffe I T.Principal Component Analysis［M］.2nd ed.Berlin,Germany:Springer,2002. ［13］Aha D,Kibler D.Machine Learning Repository［EB/OL］.［2014-12-19］.http://archive.ics.uci.edu/ml/datasets.html. ［14］Mituzas D.Page View Statistics for Wikimedia Pro-jects［EB/OL］.［2014-12-19］.http://dumps.wikimedia.org/other/pagecounts-raw/. ［15］Leutenegger S,Lopez M,Edgington J.STR:A Simple and Efficient Algorithm for R-Tree Packing［C］//Proceedings of ACM SIGMOD International Conference on Mana-gement of Data.Seattle,USA:ACM Press,1997:497-506. ［16］Dean J,Ghemawat S.Mapreduce:Simplified Data Processing on Large Clusters［C］//Proceedings of Con-ference on Symposium on Opearting Systems Design & Implementation.New York,USA:ACM Press,2004:10-21. 编辑金胡考

[1]	王庆丰, 李旭, 姚春龙, 程腾腾. 面向研究生招生咨询的中文Text-to-SQL模型[J]. 计算机工程, 2025, 51(3): 362-368.
[2]	孟凡丰, 王子聪, 张金涛, 王彦景, 欧洋, 吴利舟, 肖侬. 基于gem5的CXL内存池系统设计与实现[J]. 计算机工程, 2025, 51(3): 180-188.
[3]	黄舒怡, 谭光. 基于分区的高效视频目标检测[J]. 计算机工程, 2025, 51(2): 65-77.
[4]	李伟康, 张思全. 掩模特征融合: 实例分割新范式[J]. 计算机工程, 2025, 51(2): 126-138.
[5]	姬晨晨, 陈永青, 韩孟之. 基于国产加速器的三维卷积前向算子优化[J]. 计算机工程, 2025, 51(2): 250-258.
[6]	肖超恩, 李子凡, 张磊, 王建新, 钱思源. 基于Transformer模型与注意力机制的差分密码分析[J]. 计算机工程, 2025, 51(1): 156-163.
[7]	陈琳, 范元凯, 何震瀛, 刘晓清, 杨阳, 汤路民. SQL-to-text模型的组合泛化能力评估方法[J]. 计算机工程, 2024, 50(3): 326-335.
[8]	庞文豪, 王嘉伦, 翁楚良. GPGPU和CUDA统一内存研究现状综述[J]. 计算机工程, 2024, 50(12): 1-15.
[9]	乔艺萌, 荆一楠, 张寒冰. 健壮且自适应的学习型近似查询处理方法研究[J]. 计算机工程, 2024, 50(1): 30-38.
[10]	熊浩然, 何震瀛. 支持均匀缩放的不等长时间子序列查询方法[J]. 计算机工程, 2024, 50(1): 60-67.
[11]	郭家鼎, 王鹏. 基于数据仓库的典型图查询处理技术[J]. 计算机工程, 2023, 49(9): 32-42.
[12]	于莹莹, 丁红发, 蒋合领. 图数据精确最短距离的隐私保护外包计算方案[J]. 计算机工程, 2023, 49(9): 158-171.
[13]	夏立斌, 刘晓宇, 姜晓巍, 孙功星. 基于分布式数据集的并行计算框架内存优化方法[J]. 计算机工程, 2023, 49(4): 43-51.
[14]	李博, 黄东强, 贾金芳, 吴利, 王晓英, 黄建强. 基于CPU与GPU的异构模板计算优化研究[J]. 计算机工程, 2023, 49(4): 131-137.
[15]	刘康, 万伟, 刘波, 李俊宏, 李柱. 基于“嵩山”超级计算机的UCX库分析与优化[J]. 计算机工程, 2023, 49(12): 274-281.

选择文件类型/文献管理软件名称

选择包含的内容

基于索引的内存相似性连接算法

Memory Similarity Join Algorithm Based on Index

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于索引的内存相似性连接算法

Memory Similarity Join Algorithm Based on Index

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics

本文评价