计算机工程

• 先进计算与数据处理 • 上一篇    下一篇

基于MapReduce与两层相关性聚类的实体解析方法

王宁,黄敏   

  1. (北京交通大学计算机与信息技术学院,北京 100044)
  • 收稿日期:2014-09-02 出版日期:2015-09-15 发布日期:2015-09-15
  • 作者简介:王宁(1967-),女,副教授、博士,主研方向:Web数据集成,大数据管理,数据挖掘;黄敏,硕士研究生。
  • 基金项目:
    国家自然科学基金资助项目(61370060);江苏省自然科学基金资助项目(BK2011454)。

Entity Resolution Method Based on MapReduce and Two-tiered Correlation Clustering

WANG Ning,HUANG Min   

  1. (School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
  • Received:2014-09-02 Online:2015-09-15 Published:2015-09-15

摘要: 两层相关性聚类算法由于引入公共邻居,在解析的正确性及抗噪声能力方面性能较好。但该算法分两层执行,在时间效率上不具优势。为此,提出将该算法在MapReduce框架下实现,利用分布式计算提高其执行效率。通过设计辅助文件减少内存消耗以及中间数据的输出,给出分布式环境下的块更新规则,并改写第二层的调整块算法,将需要实时更新的数据统一计算后,根据更为显著的关联特征进行处理。实验结果表明,与TT算法和DTT算法相比,该方法不仅能保证解析的准确性,而且在时间效率上也有大幅提高。

关键词: 相关性聚类, MapReduce 模型, 实体解析, 大数据, 数据集成, 分布式计算

Abstract: Correlation clustering is a basic method for entity resolution.By introducing the concept of common neighborhood into the correlation clustering problem,two-tiered correlation clustering method is superior to traditional approaches in term of accuracy and noise immunity.However,this method is not time efficient because of its two-tiered architecture.In order to improve its efficiency in big data environment,this paper proposes a two-tiered correlation clustering method based on MapReduce.Some auxiliary files are designed to decrease memory consumption and intermediate data output.New correlation rules for adjusting blocks are proposed and adjustment algorithm in bottom tier is redesigned so that block adjustment can be processed according to the most salient correlation features.Experimental results show that the resolution method is not only accurate but also time efficient for big data.

Key words: correlation clustering, MapReduce model, entity resolution, big data, data integration, distributed computing

中图分类号: