基于数据分组匹配的相似重复记录检测

doi:10.3969/j.issn.1000-3428.2010.12.036

计算机工程 ›› 2010, Vol. 36 ›› Issue (12): 104-106. doi: 10.3969/j.issn.1000-3428.2010.12.036

基于数据分组匹配的相似重复记录检测

周丽娟，肖满生

(湖南工业大学科技学院，株洲 412008)

出版日期:2010-06-20 发布日期:2010-06-20
作者简介:周丽娟(1974－)，女，讲师、硕士，主研方向：数据库技术，数据挖掘，智能计算；肖满生，副教授、硕士
基金资助:
湖南省高等学校科学研究基金资助项目(09C339)

Detection of Approximately Duplicated Records Based on Data Grouping Matching

ZHOU Li-juan, XIAO Man-sheng

(College of Science and Technology, Hunan University of Technology, Zhuzhou 412008)

Online:2010-06-20 Published:2010-06-20

摘要/Abstract

摘要： 针对数据集成中相似重复记录的识别问题，提出一种数据特征属性优选分组的算法。通过计算特征属性的方差来确定某维属性的权值，基于数据分组思想选择权值大的属性，将数据集分割成不相交的小数据集，并在各小数据集中用模糊匹配算法进行相似重复记录的识别。理论分析和实验结果表明，该方法识别效率和检测精度较高。

关键词: 多源数据集, 属性优选, 数据分组匹配, 相似重复记录

Abstract: Approximately duplicated records in multi-source data integration is one of the key factors affecting the data quality. A data grouping algorithm based on properties optimization of records is proposed in order to improve identification accuracy and detection efficiency. The method firstly calculates the variance of a property to determine the weight of the property, then chooses the property of larger weight to split the data sets into small data sets according to the thoughts on data grouping and duplicated records are identified based on the algorithm of fuzzy matching. Through theory analysis and experiments, it shows that identification accuracy and detection efficiency of the method are higher and it can effectively solve the problems of identification in approximately duplicate records of the data integration.

Key words: multi-source data sets, properties optimization, data grouping matching, approximately duplicated records

中图分类号:

TP311

周丽娟, 肖满生. 基于数据分组匹配的相似重复记录检测[J]. 计算机工程, 2010, 36(12): 104-106.

ZHOU Li-Juan, XIAO Man-Sheng. Detection of Approximately Duplicated Records Based on Data Grouping Matching[J]. Computer Engineering, 2010, 36(12): 104-106.

http://www.ecice06.com/CN/Y2010/V36/I12/104

[1]	冉德彤,游宏梁. 一种基于标签传播的数据分块算法[J]. 计算机工程, 2017, 43(9): 51-55,61.
[2]	肖满生, 周浩慧, 王宏. 基于模糊综合评判的相似重复记录识别方法[J]. 计算机工程, 2010, 36(13): 51-53.
[3]	时念云;张金明;禇希. 基于CURE算法的相似重复记录检测[J]. 计算机工程, 2009, 35(5): 56-58.
[4]	张永;迟忠先. 位置编码在数据仓库ETL中的应用[J]. 计算机工程, 2007, 33(01): 50-52.

选择文件类型/文献管理软件名称

选择包含的内容

基于数据分组匹配的相似重复记录检测

Detection of Approximately Duplicated Records Based on Data Grouping Matching

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价

模态框（Modal）标题

选择文件类型/文献管理软件名称

选择包含的内容

基于数据分组匹配的相似重复记录检测

Detection of Approximately Duplicated Records Based on Data Grouping Matching

PDF

可视化

被引次数

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 4

编辑推荐

Metrics

本文评价