作者投稿和查稿 主编审稿 专家审稿 编委审稿 远程编辑

计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

一种基于标签传播的数据分块算法

冉德彤,游宏梁   

  1. (中国国防科技信息中心,北京 100142)
  • 收稿日期:2016-08-03 出版日期:2017-09-15 发布日期:2017-09-15
  • 作者简介:冉德彤(1992—),男,硕士研究生,主研方向为信息资源建设与服务、数据库应用;游宏梁,高级工程师。

A Data Blocking Algorithm Based on Label Propagation

RAN Detong,YOU Hongliang   

  1. (China Defense Science and Technology Information Center,Beijing 100142,China)
  • Received:2016-08-03 Online:2017-09-15 Published:2017-09-15

摘要:

数据分块有助于降低大规模数据中实体分辨的计算复杂度,但现有算法存在效能和效率难以平衡的问题。为此,在标签传播的基础上设计数据分块算法,以实现两者的平衡。根据记录间相同词项的数量估计记录相似度,通过标签传播算法发现潜在相似重复记录,降低时间复杂度。在通用测试数据中的实验结果表明,该算法能有效提高F-Measure值,减少运行时间,实现大规模数据中的数据分块。

关键词: 数据质量, 数据清洗, 实体分辨, 相似重复记录, 数据分块, 标签传播算法

Abstract:

Data blocking can reduce the increasing computational complexity of Entity Resolution(ER) in large-scale data,but there exists the problem of balancing the efficiency and effectiveness in tradition algorithms.To reach a better balance between efficiency and effectiveness,this paper proposes a data blocking algorithm based on label propagation.In this algorithm,record similarity is estimated by the number of identical lexical items between records and potential approximately duplicated records are detected by label propagation so as to reduce the time complexity.Experimental results on common test data set show that the proposed algorithm improves F-Measure value and reduces running time effectively,which can implement data blocking in large-scale data.

Key words: data quality, data cleaning, Entity Resolution(ER), approximately duplicated record, data blocking, Label Propagation Algorithm(LPA)

中图分类号: