基于CURE算法的相似重复记录检测

doi:10.3969/j.issn.1000-3428.2009.05.019

计算机工程 ›› 2009, Vol. 35 ›› Issue (5): 56-58. doi: 10.3969/j.issn.1000-3428.2009.05.019

基于CURE算法的相似重复记录检测

时念云，张金明，禇希

(中国石油大学计算机与通信工程学院，东营 257061)

收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-03-05 发布日期:2009-03-05

CURE Algorithm-based Inspection of Duplicated Records

SHI Nian-yun, ZHANG Jin-ming, CHU Xi

(College of Computer and Communication Engineering, China University of Petroleum, Dongying 257061)

Received:1900-01-01 Revised:1900-01-01 Online:2009-03-05 Published:2009-03-05

摘要/Abstract

摘要： 对CURE算法进行改进，将其应用到相似重复记录的检测。提出预抽样的概念，可以有效地确定记录集中相似重复记录的分布情况，提高随机抽样的准确性。改进代表点选择方法，提出基于距离影响因子的代表点选取策略，既可以根据数据集的密度进行代表点的选取，又能适当选取有一定意义的边缘点作为代表点，提高代表点选取的合理性。理论分析和实验表明，该方法在保证执行效率的情况下有很高的准确性。

关键词: 相似重复记录, 抽样算法, 代表点

Abstract: To inspect duplicated records, the Clustering Using Representatives(CURE) algorithm is ameliorated. The definition of pre-sampling is put forward, which can find the distribution of duplicated records so as to improve exactness of random sampling in record sets. A new method of choosing representative records for a cluster is proposed, which is based on distance infection weight. With this method, representative points are selected not only according to the density of the clusters, but also according to the importance of points including some isolated points. This method can make selecting representative points suitable. Both theory and experiment show that it is an effective approach to detect the similar duplicated records.

Key words: duplicated records, sampling algorithm, representative points

中图分类号:

TP301.6

时念云;张金明;禇希. 基于CURE算法的相似重复记录检测[J]. 计算机工程, 2009, 35(5): 56-58.

SHI Nian-yun; ZHANG Jin-ming; CHU Xi. CURE Algorithm-based Inspection of Duplicated Records[J]. Computer Engineering, 2009, 35(5): 56-58.

http://www.ecice06.com/CN/Y2009/V35/I5/56

[1]	冉德彤,游宏梁. 一种基于标签传播的数据分块算法[J]. 计算机工程, 2017, 43(9): 51-55,61.
[2]	黄云, 洪佳明, 覃遵跃. 一种基于置信度的代表点选择算法[J]. 计算机工程, 2012, 38(19): 167-169,174.
[3]	张建航, 胡予濮, 来齐齐. 基于高斯抽样算法的NTRU类数字签名方案[J]. 计算机工程, 2012, 38(17): 126-128.
[4]	肖满生, 周浩慧, 王宏. 基于模糊综合评判的相似重复记录识别方法[J]. 计算机工程, 2010, 36(13): 51-53.
[5]	周丽娟, 肖满生. 基于数据分组匹配的相似重复记录检测[J]. 计算机工程, 2010, 36(12): 104-106.
[6]	张佳;罗军勇;王艳;姚刚. 改进的无线传感器网络定位算法[J]. 计算机工程, 2009, 35(6): 133-135.
[7]	陈绍彬;叶飞跃;刘佰强;金涛. 食品HACCP分类的BIRCH算法[J]. 计算机工程, 2008, 34(23): 59-61.
[8]	张永;迟忠先. 位置编码在数据仓库ETL中的应用[J]. 计算机工程, 2007, 33(01): 50-52.

选择文件类型/文献管理软件名称

选择包含的内容

基于CURE算法的相似重复记录检测

CURE Algorithm-based Inspection of Duplicated Records

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于CURE算法的相似重复记录检测

CURE Algorithm-based Inspection of Duplicated Records

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 8

编辑推荐

Metrics

本文评价