摘要: 两层相关性聚类算法由于引入公共邻居,在解析的正确性及抗噪声能力方面性能较好。但该算法分两层执行,在时间效率上不具优势。为此,提出将该算法在MapReduce框架下实现,利用分布式计算提高其执行效率。通过设计辅助文件减少内存消耗以及中间数据的输出,给出分布式环境下的块更新规则,并改写第二层的调整块算法,将需要实时更新的数据统一计算后,根据更为显著的关联特征进行处理。实验结果表明,与TT算法和DTT算法相比,该方法不仅能保证解析的准确性,而且在时间效率上也有大幅提高。
关键词:
相关性聚类,
MapReduce 模型,
实体解析,
大数据,
数据集成,
分布式计算
Abstract: Correlation clustering is a basic method for entity resolution.By introducing the concept of common neighborhood into the correlation clustering problem,two-tiered correlation clustering method is superior to traditional approaches in term of accuracy and noise immunity.However,this method is not time efficient because of its two-tiered architecture.In order to improve its efficiency in big data environment,this paper proposes a two-tiered correlation clustering method based on MapReduce.Some auxiliary files are designed to decrease memory consumption and intermediate data output.New correlation rules for adjusting blocks are proposed and adjustment algorithm in bottom tier is
redesigned so that block adjustment can be processed according to the most salient correlation features.Experimental results show that the resolution method is not only accurate but also time efficient for big data.
Key words:
correlation clustering,
MapReduce model,
entity resolution,
big data,
data integration,
distributed computing
中图分类号: