作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程 ›› 2009, Vol. 35 ›› Issue (5): 56-58. doi: 10.3969/j.issn.1000-3428.2009.05.019

• 软件技术与数据库 • 上一篇    下一篇

基于CURE算法的相似重复记录检测

时念云,张金明,禇 希   

  1. (中国石油大学计算机与通信工程学院,东营 257061)
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2009-03-05 发布日期:2009-03-05

CURE Algorithm-based Inspection of Duplicated Records

SHI Nian-yun, ZHANG Jin-ming, CHU Xi   

  1. (College of Computer and Communication Engineering, China University of Petroleum, Dongying 257061)
  • Received:1900-01-01 Revised:1900-01-01 Online:2009-03-05 Published:2009-03-05

摘要: 对CURE算法进行改进,将其应用到相似重复记录的检测。提出预抽样的概念,可以有效地确定记录集中相似重复记录的分布情况,提高随机抽样的准确性。改进代表点选择方法,提出基于距离影响因子的代表点选取策略,既可以根据数据集的密度进行代表点的选取,又能适当选取有一定意义的边缘点作为代表点,提高代表点选取的合理性。理论分析和实验表明,该方法在保证执行效率的情况下有很高的准确性。

关键词: 相似重复记录, 抽样算法, 代表点

Abstract: To inspect duplicated records, the Clustering Using Representatives(CURE) algorithm is ameliorated. The definition of pre-sampling is put forward, which can find the distribution of duplicated records so as to improve exactness of random sampling in record sets. A new method of choosing representative records for a cluster is proposed, which is based on distance infection weight. With this method, representative points are selected not only according to the density of the clusters, but also according to the importance of points including some isolated points. This method can make selecting representative points suitable. Both theory and experiment show that it is an effective approach to detect the similar duplicated records.

Key words: duplicated records, sampling algorithm, representative points

中图分类号: